Notes

Artificial Intelligence 1
Winter Semester 2022/23

– Lecture Notes –
Prof. Dr. Michael Kohlhase

Professur für Wissensrepräsentation und -verarbeitung
Informatik, FAU Erlangen-Nürnberg
Michael.Kohlhase@FAU.de
2023-02-10
0.1. PREFACE i
0.1 Preface
0.1.1 Course Concept
Objective:
The course aims at giving students a solid (and often somewhat theoretically oriented) foun-
dation of the basic concepts and practices of artificial intelligence. The course will predominantly
cover symbolic AI – also sometimes called “good old-fashioned AI (GofAI)” – in the first semester
and offers the very foundations of statistical approaches in the second. Indeed, a full account sub
symbolic, machine learning based AI deserves its own specialization courses and needs much more
mathematical prerequisites than we can assume in this course.
Context: The course “Artificial Intelligence” (AI 1 & 2) at FAU Erlangen is a two-semester
course in the “Wahlpflichtbereich” (specialization phase) in semesters 5/6 of the Bachelor program
“Computer Science” at FAU Erlangen. It is also available as a (somewhat remedial) course in the
“Vertiefungsmodul Künstliche Intelligenz” in the Computer Science Master’s program.
Prerequisites: AI-1 & 2 builds on the mandatory courses in the FAU Bachelor’s program, in
particular the course “Grundlagen der Logik in der Informatik” [Glo], which already covers a lot
of the materials usually presented in the “knowledge and reasoning” part of an introductory AI
course. The AI 1& 2 course also minimizes overlap with the course.
The course is relatively elementary, we expect that any student who attended the mandatory
CS courses at FAU Erlangen can follow it.
Open to external students:
Other Bachelor programs are increasingly co-opting the course as specialization option. There
is no inherent restriction to computer science students in this course. Students with other study
biographies – e.g. students from other Bachelor programs our external Master’s students should
be able to pick up the prerequisites when needed.
0.1.2 Course Contents

Goal: To give students a solid foundation of the basic concepts and practices of the field of
Artificial Intelligence. The course will be based on Russell/Norvig’s book “Artificial Intelligence;
A modern Approach” [RN09]
Artificial Intelligence I (the first semester): introduces AI as an area of study, discusses
“rational agents” as a unifying conceptual paradigm for AI and covers problem solving, search,
constraint propagation, logic, knowledge representation, and planning.
Artificial Intelligence II (the second semester): is more oriented towards exposing students
to the basics of statistically based AI: We start out with reasoning under uncertainty, setting the
foundation with Bayesian Networks and extending this to rational decision theory. Building on
this we cover the basics of machine learning.
0.1.3 This Document

Format: The document mixes the slides presented in class with comments of the instructor to
give students a more complete background reference.
Caveat: This document is made available for the students of this course only. It is still very
much a draft and will develop over the course of the current course and in coming academic years.
Licensing:
This document is licensed under a Creative Commons license that requires attribution, allows
commercial use, and allows derivative works as long as these are licensed under the same license.
Knowledge Representation Experiment:
This document is also an experiment in knowledge representation. Under the hood, it uses
the STEX package [Koh08; sTeX], a TEX/LATEX extension for semantic markup, which allows to
export the contents into active documents that adapt to the reader and can be instrumented with
services based on the explicitly represented meaning of the documents.
ii
0.1.4 Acknowledgments
Materials: Most of the materials in this course is based on Russel/Norvik’s book “Artificial
Intelligence — A Modern Approach” (AIMA [RN95]). Even the slides are based on a LATEX-based
slide set, but heavily edited. The section on search algorithms is based on materials obtained from
Bernhard Beckert (then Uni Koblenz), which is in turn based on AIMA. Some extensions have
been inspired by an AI course by Jörg Hoffmann and Wolfgang Wahlster at Saarland University
in 2016. Finally Dennis Müller suggested and supplied some extensions on AGI. Florian Rabe,
Max Rapp and Katja Berčič have carefully re-read the text and pointed out problems.
All course materials have bee restructured and semantically annotated in the STEX format, so
that we can base additional semantic services on them.
AI Students: The following students have submitted corrections and suggestions to this and
earlier versions of the notes: Rares Ambrus, Ioan Sucan, Yashodan Nevatia, Dennis Müller, Si-
mon Rainer, Demian Vöhringer, Lorenz Gorse, Philipp Reger, Benedikt Lorch, Maximilian Lösch,
Luca Reeb, Marius Frinken, Peter Eichinger, Oskar Herrmann, Daniel Höfer, Stephan Mattejat,
Matthias Sonntag, Jan Urfei, Tanja Würsching, Adrian Kretschmer, Tobias Schmidt, Maxim On-
ciul, Armin Roth, Liam Corona, Tobias Völk, Lena Voigt, Yinan Shao, Michael Girstl, Matthias
Vietz, Anatoliy Cherepantsev, Stefan Musevski, Matthias Lobenhofer, Philipp Kaludercic, Di-
warkara Reddy, Martin Helmke, Stefan Müller, Dominik Mehlich, Paul Martini, Vishwang Dave,
Arthur Miehlich, Christian Schabesberger, Vishaal Saravanan, Simon Heilig, Michelle Fribrance.
0.2. RECORDED SYLLABUS iii
0.2 Recorded Syllabus

In this section, we record the progress of the course in the academic year 2022/23 in the form of
a “recorded syllabus”, i.e. a syllabus that is created after the fact rather than before. Recorded
Syllabus Winter Semester 2022/23:
# date until slide page

1 19. 10. 2022 intro/admin 695 431
2 20. 10. 2022 What is AI? 705 441
3 26. 10. 2022 Strong/Weak AI, PROLOG 47 40
4 27. 10. 2022 PROLOG, Complexity 56 46
5 2. 11. 2022 Grammars, Math Structures Agents 75 60
6 3. 11. 2022 Rationality, Agent/Evnt. types 103 73
7 9. 11. 2022 Learning Agents, Problems, BFS, UCS 138 96
8 10. 11. 2022 IDS, greedy search, A∗ 176 117
9 16. 11. 2022 local search, games, MiniMax 210 134
10 17. 11. 2022 αβ-pruning, MCTS, CSP 241 152
11 23. 11. 2022 Constratint Networks, CSP Search/Heuristics, Inference 274 172
12 24. 11. 2022 Forward Checking, AC, decomposition 294 183
13 30. 11. 2022 Cutsets, local CSP, Wumpus 329 206
14 1. 12. 2022 Syntax, Semantics PL0 323 203
15 7. 12. 2022 Hilbert-Calculus, Natural Deduction 336 211
16 8. 12. 2022 Sequents, Test Calculi, Tableaux 358 225
17 14. 12. 2022 Resolution, PLNQ 341 214
18 15. 12. 2022 Logical/Formal Systems, DPLL 384 241
19 21. 12. 2022 Clause Learning, FOL Intro 421 260
20 22. 12. 2022 FOL Syntax/Semantics, ND 439 270
21 11. 1. 2023 FO Natural Deduction/Tableaux 454 280
22 12. 1. 2023 Unification, Tableau Implementation 470 290
23 18. 1. 2023 Resolution, Prolog 1034 619
24 19. 1. 2023 Knowledge Representation & Semantic Networks 495 305
25 25. 1. 2023 Semantic Web, ALC Syntax 523 320
26 26. 1. 2023 ALC Inference, Planning Intro 561 345
27 1. 2. 2023 Fluents, Frame Axioms, STRIPS 591 360
28 2. 2. 2023 induced search problem, POP, PDDL 616 375
29 8. 2. 2023 relaxing in planning, furniture coloring 651 401
30 9. 2. 2023 belief states, real world planning/acting, recap 1156 684
iv
Here the syllabus of the last academic year for reference, the current year should be similar.
Note that the video sections carry a link, you can just click them to get to the first video and then
got to the next video via the FAU.tv interface.
Recorded Syllabus Winter Semester 2021/22:
# date topics Notes Video R&N

1 Oct. 19 Admin, lecture format 1/2 1.1-2.1
2 Oct. 20 What is AI? AI exists! 3.1-1 3.1-3 1.3/4
3 Oct. 26 AI attacks, strong/narrow AI, KWARC topics 3.3-6 3.4-7 26.1/2
4 Oct. 27 PROLOG 4 4 9.4.3-5
5 Nov. 2 Complexity, Agents, Environments 5, 6.1/2 5, 6.1/2 2.1-3
6 Nov. 3 Agent Types, Rationality 6.3-6.6 6.3-6 2.4/5
7 Nov. 9 Problem solving, Tree Search 7.1-3 7.1-3 3.1-3
8 Nov. 10 Uninformed search 7.4 7.4/5 3.4
9 Nov. 16 Informed (Heuristic) Search 7.5 7.6-9 3.5
3 10 Nov. 17 Local search, Kalah 7.6 7.10-12 4.1/2
11 Nov. 23 Games, minimax 8.1/2 8.1-3 5.1/2
12 Nov. 24 Evaluation functions, alphabeta-search 8.3/4 8.4/5 5.3
13 Nov. 30 MCTS, alphago, 8.5-7 8.5-8
14 Dec. 1. CSP, Waltz Algorithm 9.1-3 9.1-4 6.1
15 Dec. 7. CSP as Search, Inference/Propagation 9.4-10.2 9.4-10.2 6.2/3
16 Dec. 8 CSP: Arc consistency, Cutsets, local search 10.3-8 10.3-9 6.4
17 Dec. 14 Propositional Logic 11.1/2 11.1-4 7.1-4
18 Dec. 15 Formal Systems, Propositional Calculi 11.3/4 11.5-8
19 Dec. 20 Propositional Automated Theorem Proving 11.5-7 11.9-14 7.5
20 Dec. 21 SAT Solvers, DPLL, clause learning 12 12 7.6
21 Jan. 11 First-Order Syntax/Semantics, ND1 13 13 8.1-3
22 Jan. 12 Free Var. Tableaux, Unification, Prolog as Resolution 14 14 9.2-4
23 Jan. 18 Knowledge Representation, Semantic Networks/Web 15.1 15.1-4 12
24 Jan. 19 Description Logics, ALC 15.2-3.1 15.5-8
25 Jan. 25 ALC Inference, Semantic Web Technology 15.3.2-.4 15.9-12
26 Jan 26. Intro to Planning, STRIPS 16.1-3 16.1-3 10.1/2
27 Feb. 1 POP, PDDL, Planning Complexity/Algorithms 16.4-7 16.4-7 10.3
28 Feb. 2 Relaxing in Planning, Real World Planning/Acting 17 17 11.1
29 Feb. 8 Belief-State Search/Planning without observations 18.1-5 18.1-5 11.2
30 Feb. 9 Planning with Observations/ AI-1 Conclusion 18.6-8,19 18.7/8 11.3
Recorded Syllabus Summer Semester 2022:
# date topics Notes Video R&N

31 April 26. Overview, some admin 19
32 April 28. Dealing with Uncertainty 20.1 2.1-6 13.1
33 May 3. Probabilities, Independence 20.2-4 2.7-10 13.2/4
34 May 5. Reasoning, Conditional Independence 20.5-7 2.11-13 13.3/5
35 May 10. Wumpus, Bayesian Networks 20.8-21.3 2.14-3.3 13.6-14.3
36 May 12. Constructing and Inference in BN 21.4-6 3.4-7 14.4
37 May 17. Preferences and Utilities 22.1-3 4.1-4 16.1-3
38 May 19. Multi-Attribute Utility, Decision Networks 22.4-5 4.5-7 16.4/5
39 May 24. Value of Information; Markov Chains 22.6-23.1 4.8-5.1 16.6/7,15.1
May 26. Ascension, no Plenary
40 May 31. Inference: Prediction, Filtering, Smoothing 23.2 5.2-4 15.2
41 June 2. HMMs, Dynamic BN 23.3/4 5.5-7 15.3/5
June 7. Pentecost, no Plenary
42 June 9. Seq. Decision Problems, Timed Utilities, Iteration 24.1-3 6.1-4 17.1-3
43 June 14. POMDPs, POMDP Agents, Learning Theory 24.4-25.1 6.5-7.2 17.4/5
June 16. Corpus Christi, no Plenary
44 June 21. Inductive Learning, Decision Trees, Inform. Thy. 25.2-4 7.3-6 18.1-3
0.2. RECORDED SYLLABUS v
45 June 23. Evaluating/Choosing Hypotheses 25.5 7.7/8 18.4

46 June 28. Computational Learning Theory 25.6 7.9/10 18.5
47 June 30. Regression & Classification 25.7 7.11-13 18.6
48 July 5. Artificial Neural Networks 25.8 7.14-17 18.7
49 July 7. SVM, Science Slam, Bayesian Learning 25.9-26.2 7.18-8.2 18.9,20.1
50 July 12. BN Param. Learning, Naive Bayes, Logical Learning 26.3-27.1 8.3-9.2 20.2,19.1/2
51 July 14. EBL, RBL 27.2/3 9.3/4 19.3-5
52 July 19. ILP, Inverse Resolution, Reinforcement Learning 27.4 -28.3 9.5-10.2 19.5,21.1-3
53 July 21. Natural Language Processing & Semantics Intro. 29.1-3 11.1-3
54 July 26. N-grams, Information Retrieval, Word Embeddings 29.4-6 11.4-6 22
55 July 28. Grammar, Recap, What did we learn? 30/1 12.2 23.1/2
vi
Chapter 1
Administrativa
We will now go through the ground rules for the course. This is a kind of a social contract
between the instructor and the students. Both have to keep their side of the deal to make learning
as efficient and painless as possible.
Prerequisites for AI-2

Content Prerequisites: the mandatory courses in CS@FAU; Sem 1-4, in particular:
Course “Algorithmen und Datenstrukturen”. (Algorithms & Data Structures)
Course “Grundlagen der Logik in der Informatik” (GLOIN). (Logic in CS)
Course “Berechenbarkeit und Formale Sprachen”. (Theoretical CS)
Skillset Prerequisite: Coping with mathematical formulation of the structures

Mathematics is the language of science (in particular computer science)
It allows us to be very precise about what we mean. (good for you)
Intuition: (take them with a kilo of salt)
This is what I assume you know! (I have to assume something)

In most cases, the dependency on these is partial and “in spirit”.
If you have not taken these (or do not remember), read up on them as needed!
The real Prerequisite: Motivation, Interest, Curiosity, hard work. (AI-2 is
non-trivial)
You can do this course if you want! (and I hope you are successful)
Michael Kohlhase: Artificial Intelligence 2 1 2023-02-10
Now we come to a topic that is always interesting to the students: the grading scheme.
Assessment, Grades
Academic Assessment: 90 minutes exam directly after courses end (∼ July 25
2023)
Retake Exam: 90 min exam directly after courses end the following semester (∼
1
2 CHAPTER 1. ADMINISTRATIVA
Feb. 13. 2023)

Module Grade:
Grade via the exam (Klausur) ; 100% of the grade
Results from “Übungen zu Künstliche Intelligenz” give up to 10% bonus to an
exam with ≥ 50% points. (not passed ; no bonus)
I do not think that this is the best possible scheme, but I have very little choice.
I basically do not have a choice in the grading sheme, as it is essentially the only one consistent with
university/state policies. For instance, I would like to give you more incentives for the homework
assignments – which would also mitigate the risk of having a bad day in the exam. Also, graded
quizzes would help you prepare for the lectures and thus let you get more out of them, but that
is also impossible.
AI-2 Homework Assignments

Homeworks: will be small individual problem/programming/proof assignments
but take time to solve (at least read them directly ; questions)
group submission if and only if explicitly permitted.
Double Jeopardy : Homeworks only give 10% bonus points for the
exam, but without trying you are unlikely to pass the exam.
Admin: To keep things running smoothly
Homeworks will be posted on StudOn.
Sign up for AI-2 under https://www.studon.fau.de/crs4419186.html.
Homeworks are handed in electronically there. (plain text, program files, PDF)
Go to the tutorials, discuss with your TA! (they are there for you!)
Homework Discipline:
Start early! (many assignments need more than one evening’s work)
Don’t start by sitting at a blank screen (talking & study group help)
Humans will be trying to understand the text/code/math when grading it.
It is very well-established experience that without doing the homework assignments (or something
similar) on your own, you will not master the concepts, you will not even be able to ask sensible
questions, and take nothing home from the course. Just sitting in the course and nodding is not
enough! If you have questions please make sure you discuss them with the instructor, the teaching
assistants, or your fellow students. There are three sensible venues for such discussions: online in
the lecture, in the tutorials, which we discuss now, or in the course forum – see below. Finally, it
is always a very good idea to form study groups with your friends.
Tutorials for Artificial Intelligence 1

3
Approach: Weekly tutorials and homework assignments (first one in week two)
Goal 1: Reinforce what was taught in class. (you need practice)
Goal 2: Allow you to ask any question you have in a protected environment.
Instructor/Lead TA:
Florian Rabe (KWARC Postdoc)
Room: 11.137 @ Händler building, florian.rabe@fau.de
Tutorials: one each taught by Florian Rabe, . . .
Life-saving Advice: Go to your tutorial, and prepare for it by having looked at

the slides and the homework assignments!
Caveat: We cannot grade all submissions with 5 TAs and ∼1000 students.
Also: Group submission has not worked well in the past! (too many freeloaders)
One special case of academic rules that affects students is the question of cheating, which we will
cover next.
Cheating [adapted from CMU:15-211 (P. Lee, 2003)]
There is no need to cheat in this course!! (hard work will usually do)
Note: Cheating prevents you from learning (you are cutting into your own flesh)
We expect you to know what is useful collaboration and what is cheating.
You have to hand in your own original code/text/math for all assignments
You may discuss your homework assignments with others, but if doing so impairs
your ability to write truly original code/text/math, you will be cheating
Copying from peers, books or the Internet is plagiarism unless properly attributed
(even if you change most of the actual words)
I am aware that there may have been different standards about this at your previous
university! (these are the ground rules here)
There are data mining tools that monitor the originality of text/code.
Procedure: If we catch you at cheating. . . (correction: if we suspect cheating)
We will confront you with the allegation and impose a grade sanction.
If you have a reasonable explanation we lift that. (you have to convince us)
Note: Both active (copying from others) and passive cheating (allowing others to
copy) are penalized equally.
We are fully aware that the border between cheating and useful and legitimate collaboration is
difficult to find and will depend on the special case. Therefore it is very difficult to put this into
firm rules. We expect you to develop a firm intuition about behavior with integrity over the course
4 CHAPTER 1. ADMINISTRATIVA
of stay at FAU. Do use the opportunity to discuss the AI-2 topics with others. After all, one
of the non-trivial skills you want to learn in the course is how to talk about Artificial Intelligence
topics. And that takes practice, practice, and practice.
Due to the current AI hype, the course Artificial Intelligence is very popular and thus many
degree programs at FAU have adopted it for their curricula. Sometimes the course setup that fits
for the CS program does not fit the other’s very well, therefore there are some special conditions.
I want to state here.
Special Admin Conditions

Some degree programs do not “import” the course Artificial Intelligence, and thus
you may not be able to register for the exam via https://campus.fau.de.
Just send me an e-mail and come to the exam, we will issue a “Schein”.
Tell your program coordinator about AI-1/2 so that they remedy this situation
In “Wirtschafts-Informatik” you can only take AI-1 and AI-2 together in the “Wahlpflicht-
bereich”.
ECTS credits need to be divisible by five ⇝ 7.5 + 7.5 = 15.
I can only warn of what I am aware, so if your degree program lets you jump through extra hoops,
please tell me and then I can mention them here.
Chapter 2
Format of the AI Course/Lecturing
We now come to the organization of the AI lectures this semester this is really still part of the
admin, but important enough to warrant its own chapter. First let me state the obvious, but
there is an important point I want to make.
Do I need to attend the lectures

Attendance is not mandatory for the AI-2 lecture
There are two ways of learning AI-2: (both are OK, your mileage may vary)
Approach B: Read a Book

Approach I: come to the lectures, be involved, interrupt me whenever you have
a question.
The only advantage of I over B is that books do not answer questions (yet! ⇝ we
are working on this in AI research)
Approach S: come to the lectures and sleep does not work!

I really mean it: If you come to class, be involved, ask questions, challenge me
with comments, tell me about errors, . . .
I would much rather have a lively discussion than get through all the slides
You learn more, I have more fun (Approach B serves as a backup)
You may have to change your habits, overcome shyness, . . . (please do!)
This is what I get paid for, and I am more expensive than most books (get your
money’s worth)
That being said – I know that it sounds quite idealistic – can I do something to help you along
in this? Let me digress on lecturing styles ; take the following with “cum kilo salis” 1 , I want to
make a point here, not bad-mouth my colleagues.!
Traditional Lectures (cum kilo salis)
1 with much more than the proverbial grain of salt.
5
6 CHAPTER 2. FORMAT OF THE AI COURSE/LECTURING
One person talks to 50+ students who just listen and take notes
The I have a book hat you do not have style makes it hard to stay awake
It is well-known that frontal teaching does not optimize learning

But it scales very well (especially when televised)
So there is a tension between

• scalability of teaching – which is a legitimate concern for an institution like FAU, and
• effectiveness/efficiency of learning – which is a legitimate concern for students
My Lectures? What can I do to keep you awake?
We know how to keep large audiences engaged and motivated (even televised)
But the topic is different (AI-2 is arguably more complex than Sports/Media)
We’re not gonna be able to go all the way to TV entertainment (“AI-2 total”)
But I am going to (try to) incorporate some elements . . .

7
I will use interactive elements I call “questionnaires”. Here is one example to give you an idea
of what is coming.
Questionnaire
Question: How many scientific articles (6-page double-column “papers”) were
submitted to the 2020 International Joint Conference on Artificial Intelligence (IJ-
CAI’20; online in Montreal)?
a) 7? (6 accepted for publication)
b) 811? (205 accepted accepted for publication)
c) 1996? (575 accepted accepted for publication)
d) 4717? (592 accepted accepted for publication)
Answer: (d) is correct. ((c) if for IJCAI’15 . . . )
Questionnaires: are my attempt to get you to interact

At end of each logical unit (most, if I can get around to preparing them)
You get 2 -5 minutes, feel free to make noise (e.g. discuss with your neighbors)
One of the reasons why I like the questionnaire format is that it is a small instance of a question-
answer game that is much more effective in inducing learning – recall that learning happens in
the head of the student, no matter what the instructor tries to do – than frontal lectures. In
fact Sokrates – the grand old man of didactics – is said to have taught his students exclusively
by asking leading questions. His style coined the name of the teaching style “Socratic Dialogue”,
which unfortunately does not scale to a class of 100+ students.
More Generally: My Questions to You

When will I ask them?
In questionnaires.
At various points during the lectures.
We’ll do examples together.
Why do I ask them?
They give you the option to follow the lectures actively.
They allow me to check whether or not you are able to follow.
How will I look for answers?

“Streber syndrom”: 3 students answer all the questions, N − 3 sleep.
If this happens, I may resort to picking students randomly.
There is nothing to be ashamed of when giving a wrong answer! You wouldn’t
believe the number of times I got something wrong myself (I do hope all bugs are
removed now, but . . . )
8 CHAPTER 2. FORMAT OF THE AI COURSE/LECTURING
Unfortunately, this idea of adding questionnaires is mitigated by a simple fact of life. Good
questionnaires require good ideas, which are hard to come by; in particular for AI-2-2, I do not
have many. But maybe you – the students – can help.
Call for Help/Ideas with/for Questionnaires

I have some questionnaires . . . , but more would be good!
I made some good ones . . . , but better ones would be better
Please help me with your ideas (I am not Stefan Raab)
You know something about AI-2 by then.

You know when you would like to break the lecture by a questionnaire.
There must be a lot of hidden talent! (you are many, I am only one)
I would be grateful just for the idea. (I can work out the details)

Chapter 3
Resources
But what if you are not in a lecture or tutorial and want to find out more about the AI-2 topics?
Textbook, Handouts and Information, Forums, Videos

Textbook: Russel & Norvig: Artificial Intelligence, A modern Approach [RN09].
basically “broad but somewhat shallow”
great to get intuitions on the basics of AI
Make sure that you read the edition ≥ 3 ⇝ vastly improved over ≤ 2.
Course notes: will be posted at http://kwarc.info/teaching/AI/notes.pdf

more detailed than [RN09] in some areas
I mostly prepare them as we go along (semantically preloaded ; research
resource)
please e-mail me any errors/shortcomings you notice. (improve for the group)
StudOn Forum: https://www.studon.fau.de/crs4419186.html for
announcements, homeworks (my view on the forum)
questions, discussion among your fellow students (your forum too, use it!)
Course Videos: AI-2 will be streamed/recorded at https://fau.tv/course/

id/2095
Organized: Video course nuggets are available at https://fau.tv/course/
id/1690 (short; organized by topic)
Backup: The lectures from WS 2016/17 to SS 2018 have been recorded
(in English and German), see https://www.fau.tv/search/term.html?q=
Kohlhase
Next we come to a special project that is going on in parallel to teaching the course. I am using the
course materials as a research object as well. This gives you an additional resource, but may affect
the shape of the coures materials (which now serve double purpose). Of course I can use all the
help on the research project I can get, so please give me feedback, report errors and shortcomings,
and suggest improvements.
9
10 CHAPTER 3. RESOURCES
Experiment: E-Learning with KWARC Technologies

My research area: Deep representation formats for (mathematical) knowledge
Application: E-learning systems (represent knowledge to transport it)
Experiment: Start with this course (Drink my own medicine)
1. Re-Represent the slide materials in OMDoc (Open Mathematical Documents)

2. Feed it into the VoLL-KI system (http://courses.voll-ki.fau.de)
3. Try it on you all (to get feedback from you)
Tasks (I may be able to pay you for this. )
help me complete the material on the slides (what is missing/would help?)

I need to remember “what I say”, examples on the board. (take notes)
Benefits for you (so why should you help?)
you will be mentioned in the acknowledgements (for all that is worth)
you will help build better course materials (think of next-year’s students)
VoLL-KI Portal at https://courses.vollki.fau.de

Idea: Provide HTML versions of the slides/notes and embed added value services
into them. (for pre/postparation of lectures)
Definition 3.0.1. Call a document active, iff it is interactive and adapts to specific
information needs of the readers. (course notes on steroids)
Example 3.0.2 (Definition on Hover). When we hover on a (cyan) term refer-

ence, hovering shows us the definition. (even works
recursively)
When we click on the hover popup, we get even more information!

Example 3.0.3 (Guided Tour). A guided tour for a concept c assembles defini-
tions/etc. into a self-contained mini-course culminating at c.
11
FAU has issued a very insightful guide on using lecture recordings. It is a good idea to heed these
recommendations, even if they seem annoying at first.
Practical recommendations on Lecture Resources

Excellent Guide: [Nor+18a] (german Version at [Nor+18b])
Using lecture
 
Attend lectures.
recordings: Take notes.

A guide for students
Be specific.
Catch up.
Ask for help.
Don’t cut corners.

12 CHAPTER 3. RESOURCES
Chapter 4
Artificial Intelligence – Who?,

What?, When?, Where?, and Why?
We start the course by giving an overview of (the problems, methods, and issues of ) Artificial
Intelligence, and what has been achieved so far.
Naturally, this will dwell mostly on philosophical aspects – we will try to understand what
the important issues might be and what questions we should even be asking. What the most
important avenues of attacks may be and where AI research is being carried out.
In particular the discussion will be very non-technical – we have very little basis to discuss tech-
nicalities yet. But stay with me, this will drastically change very soon. A Video Nugget covering
the introduction of this chapter can be found at https://fau.tv/clip/id/21467.
Plot for this chapter

Motivation, overview, and finding out what you already know
What is Artificial Intelligence?
What has AI already achieved?
A (very) quick walk through the AI-1 topics.
How can you get involved with AI at KWARC?
4.1 What is Artificial Intelligence?

A Video Nugget covering this section can be found at https://fau.tv/clip/id/21701.
The first question we have to ask ourselves is “What is Artificial Intelligence?”, i.e. how can we
define it. And already that poses a problem since the natural definition like human intelligence,
but artificially realized presupposes a definition of Intelligence, which is equally problematic; even
Psychologists and Philosophers – the subjects nominally “in charge” of human intelligence – have
problems defining it, as witnessed by the plethora of theories e.g. found at [WHI].
What is Artificial Intelligence? Definition
13
14 CHAPTER 4. AI – WHO?, WHAT?, WHEN?, WHERE?, AND WHY?
Definition 4.1.1 (According to

Wikipedia). Artificial Intelligence (AI)
is intelligence exhibited by machines
Definition 4.1.2 (also). Artificial Intelli-
gence (AI) is a sub-field of computer science
that is concerned with the automation of in-
telligent behavior.
BUT: it is already difficult to define “Intel-
ligence” precisely
Definition 4.1.3 (Elaine Rich). Artificial

Intelligence (AI) studies how we can make
the computer do things that humans can still
do better at the moment.
Maybe we can get around the problems of defining “what Artificial intelligence is”, by just de-
scribing the necessary components of AI (and how they interact). Let’s have a try to see whether
that is more informative.
What is Artificial Intelligence? Components

4.2. ARTIFICIAL INTELLIGENCE IS HERE TODAY! 15
Elaine Rich: AI studies how we

can make the computer do things
that humans can still do better at
the moment.
This needs a combination of
the ability to learn

inference
perception
language understanding
emotion
4.2 Artificial Intelligence is here today!

The components of Artificial Intelligence are quite daunting, and none of them are fully un-
derstood, much less achieved artificially. But for some tasks we can get by with much less. And
indeed that is what the field of Artificial Intelligence does in practice – but keeps the lofty ideal
around. This practice of “trying to achieve AI in selected and restricted domains” (cf. the discus-
sion starting with slide 27) has borne rich fruits: systems that meet or exceed human capabilities
in such areas. Such systems are in common use in many domains of application.
Artificial Intelligence is here today!

in outer space
in outer space systems
need autonomous con-
trol:
remote control impos-
sible due to time lag
in artificial limbs
the user controls the
prosthesis via existing
nerves, can e.g. grip
a sheet of paper.
in household appliances
The iRobot Roomba
vacuums, mops, and
sweeps in corners, . . . ,
parks, charges, and
discharges.
general robotic house-
hold help is on the
horizon.
in hospitals
in the USA 90% of the
prostate operations are
carried out by Ro-
boDoc
Paro is a cuddly robot
that eases solitude in
nursing homes.
And here’s what you all have been waiting for . . .
AlphaGo is a program by Google DeepMind to play the board game go.

In March 2016, it beat Lee Sedol in a five-game match, the first time a go pro-
gram has beaten a 9 dan professional without handicaps. In December 2017
AlphaZero, a successor of AlphaGo “learned” the games go, chess, and shogi in
24 hours, achieving a superhuman level of play in these three games by defeating
world-champion programs. By September 2019, AlphaStar, a variant of AlphaGo,
attained “grandmaster level” in Starcraft II, a real time strategy game with partially
observable state. AlphaStar now among the top 0.2% of human players.
We will conclude this section with a note of caution.
The AI Conundrum
Observation: Reserving the term “Artificial Intelligence” has been quite a land
grab!
But: researchers at the Dartmouth Conference (1950) really thought they would
solve/reach AI in two/three decades.
Consequence: AI still asks the big questions.
Another Consequence: AI as a field is an incubator for many innovative tech-
nologies.
AI Conundrum: Once AI solves a subfield it is called “computer science”.

(becomes a separate subfield of CS)
Example 4.2.1. Functional/Logic Programming, ATPautomated theorem proving,
Planning, Machine Learning, Knowledge Representation, . . .
Still Consequence: AI research was alternatingly flooded with money and cut off
brutally.

4.3 Ways to Attack the AI Problem

There are currently three main avenues of attack to the problem of building artificially intelligent
systems. The (historically) first is based on the symbolic representation of knowledge about the
world and uses inference-based methods to derive new knowledge on which to base action decisions.
The second uses statistical methods to deal with uncertainty about the world state and learning
methods to derive new (uncertain) world assumptions to act on.
Three Main Approaches to Artificial Intelligence

Definition 4.3.1. Symbolic AI is based on the assumption that many aspects of
intelligence can be achieved by the manipulation of symbols, combining them into
structures (expressions) and manipulating them (using processes) to produce new
expressions.
Definition 4.3.2. Statistical AI remedies the two shortcomings of symbolic AI
approaches: that all concepts represented by symbols are crisply defined, and that
all aspects of the world are knowable/representable in principle. Statistical AI adopts
sophisticated mathematical models of uncertainty and uses them to create more
accurate world models and reason about them.
Definition 4.3.3. Subsymbolic AI attacks the assumption of symbolic and sta-
tistical AI that intelligence can be achieved by reasoning about the state of the
world. Instead it posits that intelligence must be embodied i.e. situated in the
world, equipped with a “body” that can interact with it via sensors and actuators.
The main method for realizing intelligent behavior is by learning from the world, i.e.
machine learning.
As a consequence, the field of Artificial Intelligence (AI) is an engineering field at the intersec-
tion of computer science (logic, programming, applied statistics), cognitive science (psychology,
neuroscience), philosophy (can machines think, what does that mean?), linguistics (natural lan-
guage understanding), and mechatronics (robot hardware, sensors).
Subsymbolic AI and in particular machine learning is currently hyped to such an extent, that
many people take it to be synonymous with “Artificial Intelligence”. It is one of the goals of this
course to show students that this is a very impoverished view.
Two ways of reaching Artificial Intelligence?
We can classify the AI approaches by their coverage and the analysis depth (they
are complementary)
4.3. WAYS TO ATTACK THE AI PROBLEM 21
Deep symbolic not there yet

AI-1 cooperation?
Shallow no-one wants this statistical/sub symbolic

AI-2
Analysis ↑
vs. Narrow Wide
Coverage →
This semester we will cover foundational aspects of symbolic AI (deep/narrow

processing)
next semester concentrate on statistical/subsymbolic AI.
(shallow/wide-coverage)
We combine the topics in this way in this course, not only because this reproduces the historical
development but also as the methods of statistical and subsymbolic AI share a common basis.
It is important to notice that all approaches to AI have their application domains and strong points.
We will now see that exactly the two areas, where symbolic AI and statistical/subsymbolic AI
have their respective fortes correspond to natural application areas.
Environmental Niches for both Approaches to AI

Observation: There are two kinds of applications/tasks in AI
Consumer tasks: consumer grade applications have tasks that must be fully
generic and wide coverage. ( e.g. machine translation like Google Translate)
Producer tasks: producer grade applications must be high-precision, but can be
domain-specific (e.g. multilingual documentation, machinery-control, program
verification, medical technology)
Precision
100% Producer Tasks
50% Consumer Tasks
103±1 Concepts 106±1 Concepts Coverage
General Rule: Subsymbolic AI is well suited for consumer tasks, while symbolic
AI is better suited for producer tasks.
A domain of producer tasks I am interested in: Mathematical/Technical Docu-

ments.
An example of a producer task – indeed this is where the name comes from – is the case of a
machine tool manufacturer T , which produces digitally programmed machine tools worth multiple
million Euro and sells them into dozens of countries. Thus T must also comprehensive machine
operation manuals, a non-trivial undertaking, since no two machines are identical and they must
be translated into many languages, leading to hundreds of documents. As those manual share a lot
of semantic content, their management should be supported by AI techniques. It is critical that
these methods maintain a high precision, operation errors can easily lead to very costly machine
damage and loss of production. On the other hand, the domain of these manuals is quite restricted.
A machine tool has a couple of hundred components only that can be described by a comple of
thousand attribute only.
Indeed companies like T employ high-precision AI techniques like the ones we will cover in this
course successfully; they are just not so much in the public eye as the consumer tasks.
To get this out of the way . . .
AlphaGo = search + neural networks (symbolic + subsymbolic AI)
we do search this semester and cover neural networks in AI-2.

I will explain AlphaGo a bit in chapter 9.
4.4 Strong vs. Weak AI

To get this out of the way before we begin: We now come to a distinction that is often mud-
dled in popular discussions about “Artificial Intelligence”, but should be cristal clear to students
of the course AI-2 – after all, you are upcoming “AI-specialists”.
Strong AI vs. Narrow AI

Definition 4.4.1. With the term narrow AI (also weak AI, instrumental AI, applied
AI) we refer to the use of software to study or accomplish specific problem solving
or reasoning tasks (e.g. playing chess/go, controlling elevators, composing music,
...)
Definition 4.4.2. With the term strong AI (also full AI, AGI) we denote the quest
for software performing at the full range of human cognitive abilities.
Definition 4.4.3. Problems requiring strong AI to solve are called AI hard.

In short: We can characterize the difference intuitively:
4.4. STRONG VS. WEAK AI 23
narrow AI: What (most) computer scientists think AI is / should be.

strong AI: What Hollywood authors think AI is / should be.
Needless to say we are only going to cover narrow AI in this course!
One can usually defuse public worries about “is AI going to take control over the world” by just
explaining the difference between strong AI and weak AI clearly.
I would like to add a few words on AGI, that – if you adopt them; they are not universally accepted
– will strengthen the arguments differentiating between strong and weak AI.
A few words on AGI. . .

The conceptual and mathematical framework (agents, environments etc.) is the
same for strong AI and weak AI.
AGI research focuses mostly on abstract aspects of machine learning (reinforce-
ment learning, neural nets) and decision/game theory (“which goals should an AGI
pursue?”).
Academic respectability of AGI fluctuates massively, recently increased (again).
(correlates somewhat with AI winters and golden years)
Public attention increasing due to talk of “existential risks of AI” (e.g. Hawking,
Musk, Bostrom, Yudkowsky, Obama, . . . )
Kohlhase’s View: Weak AI is here, strong AI is very far off. (not in my lifetime)
But even if that is true, weak AI will affect all of us deeply in everyday life.
Example 4.4.4. You should not train to be an accountant or truck driver!
(bots will replace you)
I want to conclude this section with an overview over the recent protagonists – both personal and
institutional – of AGI.
AGI Research and Researchers

“Famous” research(ers) / organizations
MIRI (Machine Intelligence Research Institute), Eliezer Yudkowsky (Formerly
known as “Singularity Institute”)
Future of Humanity Institute Oxford (Nick Bostrom),
Google (Ray Kurzweil),
AGIRI / OpenCog (Ben Goertzel),
petrl.org (People for the Ethical Treatment of Reinforcement Learners).
(Obviously somewhat tongue-in-cheek)
: Be highly skeptical about any claims with respect to AGI!(Kohlhase’s View)
4.5 AI Topics Covered

A Video Nugget covering this section can be found at https://fau.tv/clip/id/21719. We
will now preview the topics covered by the course “Artificial Intelligence” in the next two semesters.
Topics of AI-1 (Winter Semester)

Getting Started
What is Artificial Intelligence? (situating ourselves)
Logic programming in ProLog (An influential paradigm)
Intelligent Agents (a unifying framework)
Problem Solving
Problem Solving and search (Black Box World States and Actions)
Adversarial Search (Game playing) (A nice application of Search)
constraint satisfaction problems (Factored World States)
Knowledge and Reasoning
Formal Logic as the Mathematics of Meaning
Propositional logic and satisfiability (Atomic Propositions)
First-order logic and theorem proving (Quantification)
Logic programming (Logic + Search; Programming)
Description logics and semantic web
Planning
Planning Frameworks
Planning Algorithms
Planning and Acting in the real world
Topics of AI-2 (Summer Semester)

Uncertain Knowledge and Reasoning
Uncertainty
Probabilistic Reasoning
Making Decisions in Episodic Environments
Problem Solving in Sequential Environments
Foundations of Machine Learning
4.6. AI IN THE KWARC GROUP 25
Learning from Observations

Knowledge in Learning
Statistical Learning Methods
Communication (If there is time)

Natural Language Processing
Natural Language for Communication
AI1SysProj: A Systems/Project Supplement to AI-1

The AI-1 course concentrates on concepts, theory, and algorithms of symbolic AI.
Problem: Engineering/Systems Aspects of AI are very important as well.

Partial Solution: Getting your hands dirty in the homeworks and the Kalah
Challenge
Full Solution: AI1SysProj: AI-1 Systems Project (10 ECTS, 30-50places)
For each Topic of AI-1, where will be a mini-project in AI1SysProj

e.g. for game-play there will be Chinese Checkers (more difficult than Kalah)
e.g. for CSP we will schedule TechFak courses or exams (from real data)
solve challenges by implementing the AI-1 algorithms or use SoA systems
Question: Should I take AI1SysProj in my first semester? (i.e. now)

Answer: It depends . . . (on your situation)
most master’s programs require a 10-ECTS “Master’s Project”(Master AI: two)
there will be a great pressure on project places (so reserve one early)
BUT 10 ECTS =
b 250-300 hours involvement by definition (1/3 of your
time/ECTS)
BTW: There will also be an AI2SysProj next semester! (another chance)
4.6 AI in the KWARC Group

Now allow me to beat my own drum. In my research group at FAU, we do research on
a particular kind of Artificial Intelligence: logic, language, and information. This may not be
the most fashionable or well-hyped area in AI, but it is challenging, well-respected, and – most
importantly – fun.
The KWARC Research Group

Observation: The ability to represent knowledge about the world and to draw
logical inferences is one of the central components of intelligent behavior.
Thus: reasoning components of some form are at the heart of many AI systems.
KWARC Angle: Scaling up (web-coverage) without dumbing down (too much)

Content markup instead of full formalization (too tedious)
User support and quality control instead of “The Truth” (elusive anyway)
use Mathematics as a test tube ( Mathematics =
b Anything Formal )
care more about applications than about philosophy (we cannot help getting
this right anyway as logicians)
The KWARC group was established at Jacobs Univ. in 2004, moved to FAU Erlan-
gen in 2016
see http://kwarc.info for projects, publications, and links
Research in the KWARC group ranges over a variety of topics, which range from foundations
of mathematics to relatively applied web information systems. I will try to organize them into
three pillars here.
Overview: KWARC Research and Projects
Applications: eMath 3.0, Active Documents, Active Learning, Semantic Spread-

sheets/CAD/CAM, Change Mangagement, Global Digital Math Library, Math
Search Systems, SMGloM: Semantic Multilingual Math Glossary, Serious Games,
...
Foundations of Math: KM & Interaction: Semantization:
MathML, OpenMath Semantic Interpretation LATEXML: LATEX → XML
advanced Type Theories (aka. Framing) STEX: Semantic LATEX
MMT: Meta Meta The- math-literate interaction invasive editors
ory MathHub: math archi- Context-Aware IDEs
Logic Morphisms/Atlas ves & active docs
Mathematical Corpora
Theorem Prover/CAS In- Active documents: em-
bedded semantic services Linguistics of Math
teroperability
Model-based Education ML for Math Semantics
Mathematical Model- Extraction
s/Simulation
Foundations: Computational Logic, Web Technologies, OMDoc/MMT
For all of these areas, we are looking for bright and motivated students to work with us. This
can take various forms, theses, internships, and paid student assistantships.
Research Topics in the KWARC Group

We are always looking for bright, motivated KWARCies.
4.6. AI IN THE KWARC GROUP 27
We have topics in for all levels! (Enthusiast, Bachelor, Master, Ph.D.)

List of current topics: https://gl.kwarc.info/kwarc/thesis-projects/
Automated Reasoning: Maths Representation in the Large
Logics development, (Meta)n -Frameworks
Math Corpus Linguistics: Semantics Extraction
Serious Games, Cognitive Engineering, Math Information Retrieval, Legal Rea-
soning, . . .
We always try to find a topic at the intersection of your and our interests.
1
We also often have positions!. (HiWi, Ph.D.: 2 , PostDoc: full)
Sciences like physics or geology, and engineering need high-powered equipment to perform
measurements or experiments. computer science and in particular the KWARC group needs high
powered human brains to build systems and conduct thought experiments.
The KWARC group may not always have as much funding as other AI research groups, but
we are very dedicated to give the best possible research guidance to the students we supervise.
So if this appeals to you, please come by and talk to us.
Part I
Getting Started with AI: A

Conceptual Framework
29
31
This part of the course note sets the stage for the technical parts of the course by establishing
a common framework (Rational Agents) that gives context and ties together the various methods
discussed in the course.
After having seen what AI can do and where AI is being employed today (see chapter 4), we will
now
1. introduce a programming language to use in the course,

2. prepare a conceptual framework in which we can think about “intelligence” (natural and artifi-
cial), and
3. recap some methods and results from theoretical computer science that we will need throughout
the course.
ad 1. ProLog:
For the programming language we choose ProLog, historically one of the most influential “AI
programming languages”. While the other AI programming language: Lisp which gave rise to
the functional programming programming paradigm has been superseded by typed languages like
SM LLanguage, Haskell, Scala, and F#, ProLog is still the prime example of the declarative
programming paradigm. So using ProLog in this course gives students the opportunity to explore
this paradigm. At the same time, ProLog is well-suited for trying out algorithms in symbolic
AI the topic of this semester since it internalizes the more complex primitives of the algorithms
presented here.
ad 2. Rational Agents: The conceptual framework centers around rational agents which
combine aspects of purely cognitive architectures (an original concern for the field of AI) with the
more recent realization that intelligence must interact with the world (embodied AI) to grow and
learn. The cognitive architectures aspect allows us to place and relate the various algorithms and
methods we will see in this course. Unfortunately, the “situated AI” aspect will not be covered in
this course due to the lack of time and hardware.
ad 3. Topics of Theoretical Computer Science: When we evaluate the methods and
algorithms introduced in AI-2, we will need to judge their suitability as agent functions. The main
theoretical tool for that is complexity theory; we will give a short motivation and overview of the
main methods and results as far as they are relevant for AI-2 in section 6.1.
In the second half of the semester we will transition from search-based methods for problem
solving to inference-based ones, i.e. where the problem formulation is described as expressions of a
formal language which are transformed until an expression is reached from which the solution can
be read off. Phrase structure grammars are the method of choice for describing such languages;
we will introduce/recap them in section 6.2.
Enough philosophy about “Intelligence” (Artificial or Natural)

So far we had a nice philosophical chat, about “intelligence” et al.
As of today, we look at technical stuff!
Before we go into the algorithms and data structures proper, we will
1. introduce a programming language for AI-2

2. prepare a conceptual framework in which we can think about “intelligence” (nat-
ural and artificial), and
3. recap some methods and results from theoretical computer science.

32
Chapter 5
Logic Programming
We will now learn a new programming paradigm: logic programming, which is one of the most
influential paradigms in AI. We are going to study ProLog (the oldest and most widely used) as a
concrete example of ideas behind logic programming and use it for our homeworks in this course.
As ProLog is a representative of a programming paradigm that is new to most students, pro-
gramming will feel weird and tedious at first. But subtracting the unusual syntax and program
organization logic programming really only amounts to recursive programming just as in func-
tional programming (the other declarative programming paradigm). So the usual advice applies,
keep staring at it and practice on easy examples until the pain goes away.
5.1 Introduction to Logic Programming and ProLog
Logic programming is a programming paradigm that differs from functional and imperative
programming in the basic procedural intuition. Instead of transforming the state of the memory
by issuing instructions (as in imperative programming), or computing the value of a function on
some arguments, logic programming interprets the program as a body of knowledge about the
respective situation, which can be queried for consequences.
This is actually a very natural conception of program; after all we usually run (imperative or
functional) programs if we want some question answered. Video Nuggets covering this section
can be found at https://fau.tv/clip/id/21752 and https://fau.tv/clip/id/21753. .
Logic Programming
Idea: Use logic as a programming language!
We state what we know about a problem (the program) and then ask for results
(what the program would compute).
Example 5.1.1.
Program Leibniz is human x+0=x

Sokrates is human If x + y = z then x + s(y) = s(z)
Sokrates is a greek 3 is prime
Every human is fallible
Query Are there fallible greeks? is there a z with s(s(0)) + s(0) = z
Answer Yes, Sokrates! yes s(s(s(0)))
33
34 CHAPTER 5. LOGIC PROGRAMMING
How to achieve this? Restrict a logic calculus sufficiently that it can be used as
computational procedure.
Remark: This idea leads a totally new programming paradigm: logic programming.
Slogan: Computation = Logic + Control (Robert Kowalski 1973; [Kow97])

We will use the programming language ProLog as an example.
We now formally define the language of ProLog, starting off the atomic building blocks.
ProLog Programs: Terms and Literals

Definition 5.1.2. ProLog programs express knowledge about the world via
constants denoted by lower case strings,
variables denoted by upper-case strings or starting with _, and
functions and predicates (lower-case strings) applied to terms.
Definition 5.1.3. A ProLog term is

a ProLog variable, or constant, or
a ProLog function applied to terms.
A ProLog literal is a constant or a predicate applied to terms.
Example 5.1.4. The following are

ProLog terms: john, X, _, father(john), . . .
ProLog literals: loves(john,mary), loves(john,_), loves(john,wife_of(john)),. . .
Now we build up ProLog programs from those building blocks.
ProLog Programs: Facts and Rules

Definition 5.1.5. A ProLog program is a sequence of clauses, i.e.
facts of the form l., where l is a literal, (a literal and a dot)
rules of the form h:−b1 ,. . .,bn , where h is called the head literal (or simply head)
and the bi are together called the body of the rule.
A rule h: b1 ,. . .,bn , should be read as h (is true) if b1 and . . . and bn are.
Example 5.1.6. The following is a ProLog program:

human(leibniz).
human(sokrates).
greek(sokrates).
fallible(X):−human(X).
The first three lines are ProLog facts and the last a rule.
5.1. INTRODUCTION TO LOGIC PROGRAMMING AND PROLOG 35
Definition 5.1.7. The knowledge base given by a ProLog program is the set of
facts that can be derived from it under the if/and reading above.
Definition 5.1.7 introduces a very important distinction: that between a ProLog program and the
knowledge base it induces. Whereas the former is a finite, syntactic object (essentially a string),
the latter may be an infinite set of facts, which represents the totality of knowledge about the
world or the aspects described by the program.
As knowledge bases can be infinite, we cannot pre compute them. Instead, logic programming
languages compute fragments of the knowledge base by need; i.e. whenever a user wants to check
membership; we call this approach querying: the user enters a query term and the system answers
yes or no. This answer is computed in a depth first search process.
Querying the Knowledge Base: Size Matters

Idea: We want to see whether a fact is in the knowledge base.
Definition 5.1.8. A query is a list of ProLog terms called goal literal (also subgoal
or simply goals). We write a query as ?−A1 , . . ., An . where Ai are goals.
Problem: Knowledge bases can be big and even infinite. (cannot pre compute)
Example 5.1.9. The knowledge base induced by the ProLog program
nat(zero).
nat(s(X)) :− nat(X).
contains the facts nat(zero), nat(s(zero)), nat(s(s(zero))), . . .
Querying the Knowledge Base: Backchaining

Definition 5.1.10. Given a query Q: ? A1 , . . ., An . and rule R: h: b1 ,. . .,bn ,
backchaining computes a new query by
1. finding terms for all variables in h to make h and A1 equal and
2. replacing A1 in Q with the body literals of R, where all variables are suitably
replaced.
Backchaining motivates the names goal/subgoal:
the literals in the query are “goals” that have to be satisfied,
backchaining does that by replacing them by new “goals”.
Definition 5.1.11. The ProLog interpreter keeps backchaining from the top to
the bottom of the program until the query
succeeds, i.e. contains no more goals, or (answer: true)
fails, i.e. backchaining becomes impossible. (anser: false)
Example 5.1.12 (Backchaining). We continue Example 5.1.9

?− nat(s(s(zero))).
?− nat(s(zero)).
?− nat(zero).
true
Note that backchaining replaces the current query with the body of the rule suitably instantiated.
For rules with a long body this extends the list of current goals, but for facts (rules without a
body), backchaining shortens the list of current goals. Once there are no goals left, the ProLog
interpreter finishes and signals success by issuing the string true.
If no rules match the current goal, then the interpreter terminates and signals failure with the
string false,
Querying the Knowledge Base: Failure

If no instance of a query can be derived from the knowledge base, then the ProLog
interpreter reports failure.
Example 5.1.13. We vary Example 5.1.12 using 0 instead of zero.
?− nat(s(s(0))).
?− nat(s(0)).
?− nat(0).
FAIL
false
We can extend querying from simple yes/no answers to programs that return values by simply
using variables in queries. In this case, the ProLog interpreter returns a substitution.
Querying the Knowledge base: Answer Substitutions

Definition 5.1.14. If a query contains variables, then ProLog will return an answer
substitution, i.e the values for all the query variables accumulated during repeated
backchaining.
Example 5.1.15. We talk about (Bavarian) cars for a change, and use a query
with a variables
has_wheels(mybmw,4).
has_motor(mybmw).
car(X):−has_wheels(X,4),has_motor(X).
?− car(Y) % query
?− has_wheels(Y,4),has_motor(Y). % substitution X = Y
?− has_motor(mybmw). % substitution Y = mybmw
Y = mybmw % answer substitution
true
In Example 5.1.15 the first backchaining step binds the variable X to they query variable Y, which
gives us the two subgoals has_wheels(Y,4),has_motor(Y). which again have the query variable Y.
5.2. PROGRAMMING AS SEARCH 37
The next backchaining step binds this to mybmw, and the third backchaining step exhausts the
subgoals. So the query succeeds with the (overall) answer substitution Y = mybmw. With this
setup, we can already do the “fallible Greeks” example from the introduction.
PROLOG: Are there Fallible Greeks?

Program:
human(leibniz).
human(sokrates).
greek(sokrates).
Example 5.1.16 (Query). ?−fallible(X),greek(X).

Answer substitution: [sokrates/X]
5.2 Programming as Search
In this section, we want to really use ProLog as a programming language, so let use first get
our tools set up.
Video Nuggets covering this section can be found at https://fau.tv/clip/id/21754 and
https://fau.tv/clip/id/21827.
5.2.1 Knowledge Bases and Backtracking

We will now discuss how to use a ProLog interpreter to get to know the language. The SWI
ProLog interpreter can be downloaded from http://www.swi-prolog.org/. To start the ProLog
interpreter with pl or prolog or swipl from the shell. The SWI manual is available at http:
//www.swi-prolog.org/pldoc/
We will introduce working with the interpreter using unary natural numbers as examples: we
first add the fact1 to the knowledge base
unat(zero).
which asserts that the predicate unat2 is true on the term zero. Generally, we can add a fact to
the knowledge base either by writing it into a file (e.g. example.pl) and then “consulting it” by
writing one of the following commands into the interpreter:
[example]
consult(’example.pl’).
consult(’example’).
or by directly typing
assert(unat(zero)).
into the ProLog interpreter. Next tell ProLog about the following rule
assert(unat(suc(X)) :− unat(X)).
which gives the ProLog runtime an initial (infinite) knowledge base, which can be queried by
1 for “unary natural numbers”; we cannot use the predicate nat and the constructor function s here, since their
meaning is predefined in ProLog
2 for “unary natural numbers”.
?− unat(suc(suc(zero))).
Even though we can use any text editor to program ProLog, but running ProLog in a modern
editor with language support is incredibly nicer than at the command line, because you can see
the whole history of what you have done. Its better for debugging too. We will use emacs as an
example in the following.
If you’ve never used emacs before, it still might be nicer, since its pretty easy to get used to the
little bit of emacs that you need. (Just type “emacs \&” at the UNIX command line to run it; if
you are on a remote terminal without visual capabilities, you can use “emacs −nw”.).
If you don’t already have a file in your home directory called “.emacs” (note the dot at the front),
create one and put the following lines in it. Otherwise add the following to your existing .emacs
file:
(autoload ’run−prolog "prolog" "Start a Prolog sub−process." t)
(autoload ’prolog−mode "prolog" "Major mode for editing Prolog programs." t)
(setq prolog−program−name "swipl"); or whatever the prolog executable name is
(add−to−list ’auto−mode−alist ’("\\pl$" . prolog−mode))
The file prolog.el, which provides prolog−mode should already be installed on your machine, oth-
erwise download it at http://turing.ubishops.ca/home/bruda/emacs-prolog/
Now, once you’re in emacs, you will need to figure out what your “meta” key is. Usually its the alt
key. (Type “control” key together with “h” to get help on using emacs). So you’ll need a “meta−X”
command, then type “run−prolog”. In other words, type the meta key, type “x”, then there will
be a little window at the bottom of your emacs window with “M−x”, where you type run−prolog3 .
This will start up the SWI ProLog interpreter, . . . et voilà!
The best thing is you can have two windows “within” your emacs window, one where you’re
editing your program and one where you’re running ProLog. This makes debugging easier.
Depth-First Search with Backtracking
So far, all the examples led to direct success or to failure. (simpl. KB)
Definition 5.2.1 (ProLog Search Procedure). The ProLog interpreter employes
top-down, left-right depth first search, concretely, it:
works on the subgoals in left right order.
matches first query with the head literals of the clauses in the program in top-
down order.
if there are no matches, fail and backtrack to the (chronologically) last backtrack
point.
otherwise backchain on the first match, keep the other matches in mind for
backtracking via backtrack points.
We can force backtracking to get more solutions by typing ;.
Note:
With the ProLog search procedure detailed above, computation can easily go into infinite loops,
even though the knowledge base could provide the correct answer. Consider for instance the simple
program
p(X):− p(X).
p(X):− q(X).
q(X).
3 Type “control” key together with “h” then press “m” to get an exhaustive mode help.
If we query this with ?− p(john), then DFS will go into an infinite loop because ProLog expands
by default the first predicate. However, we can conclude that p(john) is true if we start expanding
the second predicate.
In fact this is a necessary feature and not a bug for a programming language: we need to
be able to write non-terminating programs, since the language would not be Turing complete
otherwise. The argument can be sketched as follows: we have seen that for Turing machines the
halting problem is undecidable. So if all ProLog programs were terminating, then ProLog would
be weaker than Turing machines and thus not Turing complete.
We will now fortify our intuition about the ProLog search procedure by an example that extends
the setup from Example 5.1.15 by a new choice of a vehicle that could be a car (if it had a motor).
Backtracking by Example
Example 5.2.2. We extend Example 5.1.15:
has_wheels(mytricycle,3).
has_wheels(myrollerblade,3).
has_wheels(mybmw,4).
has_motor(mybmw).
car(X):-has_wheels(X,3),has_motor(X). % cars sometimes have three wheels
car(X):-has_wheels(X,4),has_motor(X). % and sometimes four.
?- car(Y).
?- has_wheels(Y,3),has_motor(Y). % backtrack point 1
Y = mytricycle % backtrack point 2
?- has_motor(mytricycle).
FAIL % fails, backtrack to 2
Y = myrollerblade % backtrack point 2
?- has_motor(myrollerblade).
FAIL % fails, backtrack to 1
?- has_wheels(Y,4),has_motor(Y).
Y = mybmw
?- has_motor(mybmw).
Y=mybmw
true
In general, a ProLog rule of the form A:−B,C reads as A, if B and C. If we want to express A if
B or C, we have to express this two separate rules A:−B and A:−C and leave the choice which
one to use to the search procedure.
In Example 5.2.2 we indeed have two clauses for the predicate car/1; one each for the cases of cars
with three and four wheels. As the three-wheel case comes first in the program, it is explored first
in the search process.
Recall that at every point, where the ProLog interpreter has the choice between two clauses for
a predicate, chooses the first and leaves a backtrack point. In Example 5.2.2 this happens first
for the predicate car/1, where we explore the case of three-wheeled cars. The ProLog interpreter
immediately has to choose again – between the tricycle and the rollerblade, which both have three
wheels. Again, it chooses the first and leaves a backtrack point. But as tricycles do not have mo-
tors, the subgoal has_motor(mytricycle) fails and the interpreter backtracks to the chronologically
nearest backtrack point (the second one) and tries to fulfill has_motor(myrollerblade). This fails
again, and the next backtrack point is point 1 – note the stack-like organization of backtrack points
which is in keeping with the depth-first search strategy – which chooses the case of four-wheeled
cars. This ultimately succeeds as before with y=mybmw.
5.2.2 Programming Features
We now turn to a more classical programming task: computing with numbers. Here we turn
to our initial example: adding unary natural numbers. If we can do that, then we have to consider
ProLog a programming language.
Can We Use This For Programming?

Question: What about functions? E.g. the addition function?
Question: We cannot define functions, in ProLog!
Idea (back to math): use a three-place predicate.
Example 5.2.3. add(X,Y,Z) stands for X+Y=Z

Now we can directly write the recursive equations X + 0 = X (base case) and
X + s(Y ) = s(X + Y ) into the knowledge base.
add(X,zero,X).
add(X,s(Y),s(Z)) :− add(X,Y,Z).
Similarly with multiplication and exponentiation.

mult(X,zero,zero).
mult(X,s(Y),Z) :− mult(X,Y,W), add(X,W,Z).
expt(X,zero,s(zero)).
expt(X,s(Y),Z) :− expt(X,Y,W), mult(X,W,Z).
Note: Viewed through the right glasses logic programming is very similar to functional program-
ming; the only difference is that we are using n + 1 ary relations rather than n ary function. To see
how this works let us consider the addition function/relation example above: instead of a binary
function + we program a ternary relation add, where relation add(X,Y ,Z) means X + Y = Z. We
start with the same defining equations for addition, rewriting them to relational style.
The first equation is straight-forward via our correspondence and we get the ProLog fact
add(X,zero,X). For the equation X + s(Y ) = s(X + Y ) we have to work harder, the straight-
forward relational translation add(X,s(Y),s(X+Y)) is impossible, since we have only partially
replaced the function + with the relation add. Here we take refuge in a very simple trick that we
can always do in logic (and mathematics of course): we introduce a new name Z for the offending
expression X + Y (using a variable) so that we get the fact add(X,s(Y ),s(Z)). Of course this is
not universally true (remember that this fact would say that “X + s(Y ) = s(Z) for all X, Y , and
Z”), so we have to extend it to a ProLog rule add(X,s(Y),s(Z)):−add(X,Y,Z). which relativizes to
mean “X + s(Y ) = s(Z) for all X, Y , and Z with X + Y = Z”.
Indeed the rule implements addition as a recursive predicate, we can see that the recursion
relation is terminating, since the left hand sides have one more constructor for the successor
function. The examples for multiplication and exponentiation can be developed analogously, but
we have to use the naming trick twice.
We now apply the same principle of recursive programming with predicates to other examples
to reinforce our intuitions about the principles.
More Examples from elementary Arithmetic

Example 5.2.4. We can also use the add relation for subtraction without changing
the implementation. We just use variables in the “input positions” and ground terms
in the other two. (possibly very inefficient “generate and test approach”)
?−add(s(zero),X,s(s(s(zero)))).
X = s(s(zero))
true
Example 5.2.5.
Computing the nth Fibonacci number (0, 1, 1, 2, 3, 5, 8, 13,. . . ; add the last two
to get the next), using the addition predicate above.
fib(zero,zero).
fib(s(zero),s(zero)).
fib(s(s(X)),Y):−fib(s(X),Z),fib(X,W),add(Z,W,Y).
Example 5.2.6. Using ProLog’s internal arithmetic: a goal of the form ?− D is e.

— where e is a ground arithmetic expression binds D to the result of evaluating e.
fib(0,0).
fib(1,1).
fib(X,Y):− D is X − 1, E is X − 2,fib(D,Z),fib(E,W), Y is Z + W.
Note: Note that the is relation does not allow “generate and test” inversion as it insists on the
right hand being ground. In our example above, this is not a problem, if we call the fib with the
first (“input”) argument a ground term. Indeed, if match the last rule with a goal ?− g,Y., where
g is a ground term, then g−1 and g−2 are ground and thus D and E are bound to the (ground)
result terms. This makes the input arguments in the two recursive calls ground, and we get ground
results for Z and W, which allows the last goal to succeed with a ground result for Y. Note as
well that re-ordering the bodys literal of the rule so that the recursive calls are called before the
computation literals will lead to failure.
We will now add the primitive data structure of lists to ProLog; they are constructed by prepending
an element (the head) to an existing list (which becomes the rest list or “tail” of the constructed
one).
Adding Lists to ProLog

Lists are represented by terms of the form [a,b,c,. . .]
First/rest representation [F|R], where R is a rest list.
predicates for member, append and reverse of lists in default ProLog representation.
member(X,[X|_]).
member(X,[_|R]):−member(X,R).
append([],L,L).
append([X|R],L,[X|S]):−append(R,L,S).
reverse([],[]).
reverse([X|R],L):−reverse(R,S),append(S,[X],L).
Just as in functional programming languages, we can define list operations by recursion, only that
we program with relations instead of with functions.
Logic programming is the third large programming paradigm (together with functional program-
ming and imperative programming).
Relational Programming Techniques

Example 5.2.7. Parameters have no unique direction “in” or “out”
?− rev(L,[1,2,3]).
?− rev([1,2,3],L1).
?− rev([1|X],[2|Y]).
Example 5.2.8. Symbolic programming by structural induction

rev([],[]).
rev([X|Xs],Ys) :− ...
Example 5.2.9.
Generate and test:
sort(Xs,Ys) :− perm(Xs,Ys), ordered(Ys).
From a programming practice point of view it is probably best understood as “relational program-
ming” in analogy to functional programming, with which it shares a focus on recursion.
The major difference to functional programming is that “relational programming” does not have
a fixed input/output distinction, which makes the control flow in functional programs very direct
and predictable. Thanks to the underlying search procedure, we can sometime make use of the
flexibility afforded by logic programming.
If the problem solution involves search (and depth first search is sufficient), we can just get
by with specifying the problem and letting the ProLog interpreter do the rest. In Example 5.2.9
we just specify that list Xs can be sorted into Ys, iff Ys is a permutation of Xs and Ys is ordered.
Given a concrete (input) list Xs, the ProLog interpreter will generate all permutations of Ys of Xs
via the predicate perm/2 and then test them whether they are ordered.
This is a paradigmatic example of logic programming. We can (sometimes) directly use the
specification of a problem as a program. This makes the argument for the correctness of the
program immediate, but may make the program execution non optimal.
5.2.3 Advanced Relational Programming
It is easy to see that the running time of the ProLog program from Example 5.2.9 is not
O(nlog2 (n)) which is optimal for sorting algorithms. This is the flip side of the flexibility in logic
programming. But ProLog has ways of dealing with that: the cut operator, which is a ProLog
atom, which always succeeds, but which cannot be backtracked over. This can be used to prune
the search tree in ProLog. We will not go into that here but refer the readers to the literature.
Specifying Control in ProLog

Remark 5.2.10. The running time of the program from Example 5.2.9 is not
O(nlog2 (n)) which is optimal for sorting algorithms.
sort(Xs,Ys) :− perm(Xs,Ys), ordered(Ys).
Idea: Gain computational efficiency by shaping the search!
Functions and Predicates in ProLog

Remark 5.2.11. Functions and predicates have radically different roles in ProLog.
Functions are used to represent data. (e.g. father(john) or s(s(zero)))
Predicates are used for stating properties about and computing with data.
Remark 5.2.12. In functional programming, functions are used for both.
(even more confusing than in ProLog if you think about it)
Example 5.2.13. Consider again the reverse program for lists below:
An input datum is e.g. [1,2,3], then the output datum is [3,2,1].
reverse([],[]).
reverse([X|R],L):−reverse(R,S),append(S,[X],L).
We “define” the computational behavior of the predicate rev, but the list constructors
[. . .] are just used to construct lists from arguments.
Example 5.2.14 (Trees and Leaf Counting). We represent (unlabelled) trees via
the function t from tree lists to trees. For instance, a balanced binary tree of depth
2 is t([t([t([]),t([])]),t([t([]),t([])])]). We count leaves by
leafcount(t([]),1).
leafcount(t([X|R]),Y) :− leafcount(X,Z), leafcount(t(R,W)), Y is Z + W.
For more information on ProLog
RTFM (b
= “read the fine manuals”)
RTFM Resources: There are also lots of good tutorials on the web,
I personally like [Fis; LPN],
[Fla94] has a very thorough logic-based introduction,
consult also the SWI Prolog Manual [SWI],

Chapter 6
Recap of Prerequisites from Math &

Theoretical Computer Science
In this chapter we will briefly recap some of the prerequisites from theoretical computer science
that are needed for understanding Artificial Intelligence 1.
6.1 Recap: Complexity Analysis in AI?
We now come to an important topic which is not really part of Artificial Intelligence but which
adds an important layer of understanding to this enterprise: We (still) live in the era of Moore’s
law (the computing power available on a single CPU doubles roughly every two years) leading to an
exponential increase. A similar rule holds for main memory and disk storage capacities. And the
production of computer (using CPUs and memory) is (still) very rapidly growing as well; giving
mankind as a whole, institutions, and individual exponentially grow of computational resources.
In public discussion, this development is often cited as the reson why (strong) AI is inevitable.
But the argument is fallacious if all the algorithms we have are of very high complexity (i.e. at
least exponential in either time or space). So, to judge the state of play in Artificial Intelligence,
we have to know the complexity of our algorithms.
In this section, we will give a very brief recap of some aspects of elementary complexity theory
and make a case of why this is a generally important for computer scientists.
A Video Nugget covering this section can be found at https://fau.tv/clip/id/21839 and
In order to get a feeling what we mean by “fast algorithm”, we to some preliminary computations.
Performance and Scaling

Suppose we have three algorithms to choose from. (which one to select)
Systematic analysis reveals performance characteristics.
Example 6.1.1. For a problem of size n we have
45
46CHAPTER 6. RECAP OF PREREQUISITES FROM MATH & THEORETICAL COMPUTER SCIENCE
performance
size linear quadratic exponential
n 100nµs 7n2 µs 2n µs
1 100µs 7µs 2µs
5 .5ms 175µs 32µs
10 1ms .7ms 1ms
45 4.5ms 14ms 1.1Y
100 ... ... ...
1 000 ... ... ...
10 000 ... ... ...
1 000 000 ... ... ...
What?! One year?

210 = 1 024 (1024µs ≃ 1ms)
245 = 35 184 372 088 832 (3.5×1013 µs ≃ 3.5×107 s ≃ 1.1Y )

Example 6.1.2. we denote all times that are longer than the age of the universe
with −
performance
size linear quadratic exponential
n 100nµs 7n2 µs 2n µs
1 100µs 7µs 2µs
5 .5ms 175µs 32µs
10 1ms .7ms 1ms
45 4.5ms 14ms 1.1Y
< 100 100ms 7s 1016 Y
1 000 1s 12min −
10 000 10s 20h −
1 000 000 1.6min 2.5mon −
So it does make a difference for larger problems what algorithm we choose. Considerations like
the one we have shown above are very important when judging an algorithm. These evaluations
go by the name of “complexity theory”.
Let us now recapitulate some notions of elementary complexity theory: we are interested in the
worst case growth of the resources (time and space) required by an algorithm in terms of the sizes
of its arguments. Mathematically we look at the functions from input size to resource size and
classify them into “big-O” classes, abstracting from constant factors (which depend on the machine
thealgorithm runs on and which we cannot control) and initial (algorithm startup) factors.
Recap: Time/Space Complexity of Algorithms

We are mostly interested in worst-case complexity in AI-2.
Definition: Let S ⊆ N → N be a set of natural number functions, then we say
that analgorithm α that terminates in time t(n) for all inputs of size n has running
6.1. RECAP: COMPLEXITY ANALYSIS IN AI? 47
time T (α):=t.
We say that α has time complexity in S (written T (α)∈S or colloquially T (α)=S),
iff t∈S. We say α has space complexity in S, iff α uses only memory of size s(n)
on inputs of size n and s∈S.
Time/space complexity depends on size measures. (no canonical one)

Definition: The following sets are often used for S in T (α):
Landau set class name rank Landau set class name rank
O(1) constant 1 O(n2 ) quadratic 4
O(ln(n)) logarithmic 2 O(nk ) polynomial 5
O(n) linear 3 O(k n ) exponential 6
where O(g) = {f |∃k>0 f ≤a k · g} and f ≤a g (f is asymptotically bounded by g),

iff there is an n0 ∈N, such that f (n)≤g(n) for all n>n0 .
For k ′ >2 and k>1 we have
′
O(1)⊂O(logn)⊂O(n)⊂O(n2 )⊂O(nk )⊂O(k n )
For AI-2: I expect that given analgorithm, you can determine its complexity class.
(next)
OK, that was the theory, . . . but how do we use that in practice.
Determining the Time/Space Complexity of Algorithms

Given a function γ that maps variables v to sets Γ(v), we compute TΓ (α) and
CΓ (α) of an imperative algorithm α by induction on the structure of α:
constant: can be accessed in constant time If α = δ for a data constant δ, then

TΓ (α)∈O(1).
variable: need the complexity of the value
If α = v with v∈dom(Γ), then TΓ (α)∈O(Γ(v)).
application: compose the complexities of the function and the argument
If α = φ(ψ) with TΓ (φ)∈O(f ) and TΓ∪CΓ (φ) (ψ)∈O(g), then TΓ (α)∈O(f ◦ g)
and CΓ (α) = CΓ∪CΓ (φ) (ψ).
assignment: has to compupte the value ; has its complexity
If α is v:= φ with TΓ (φ)∈S, then TΓ (α)∈S and CΓ (α) = Γ ∪ (v,S).
composition: has the maximal complexity of the components
If α is φ ; ψ, with TΓ (φ)∈P and TΓ∪CΓ (ψ) (ψ)∈Q, then TΓ (α)∈max {P , Q} and
CΓ (α) = CΓ∪CΓ (ψ) (ψ).
branching: has the maximal complexity of the condition and branches
If α is ifγthenφelseψend, with TΓ (γ)∈C, TΓ∪CΓ (γ) (φ)∈P , TΓ∪CΓ (γ) (φ)∈Q,
and then TΓ (α)∈max {C, P , Q} and CΓ (α) = Γ ∪ CΓ (γ) ∪ CΓ∪CΓ (γ) (φ) ∪
CΓ∪CΓ (γ) (ψ).
looping: multiplies complexities

If α is whileγdoφend, with TΓ (γ)∈O(f ), TΓ∪CΓ (γ) (φ)∈O(g), then TΓ (α)∈O(f (n)·
g(n)) and CΓ (α) = CΓ∪CΓ (γ) (φ).
The time complexity T (α) is just T∅ (α), where ∅ is the empty function.
Recursion is much more difficult to analyze ; recurrence relations and Master’s
theorem.
Please excuse the chemistry pictures, public imagery for CS is really just quite boring, this is what
people think of when they say “scientist”. So, imagine that instead of a chemist in a lab, it’s me
sitting in front of a computer.
Why Complexity Analysis? (General)

Example 6.1.3. Once upon a time I was trying to invent an efficient algorithm.
My first algorithm attempt didn’t work, so I had to try harder.
But my 2nd attempt didn’t work either, which got me a bit agitated.
The 3rd attempt didn’t work either. . .

6.1. RECAP: COMPLEXITY ANALYSIS IN AI? 49
And neither the 4th. But then:
Ta-da . . . when, for once, I turned around and looked in the other direction–
CAN one actually solve this efficiently? – NP hardness was there to rescue me.
Why Complexity Analysis? (General)
Example 6.1.4. Trying to find a sea route east to India (from Spain) (does not
exist)
Observation: Complexity theory saves you from spending lots of time trying to
invent algorithms that do not exist.
It’s like, you’re trying to find a route to India (from Spain), and you presume it’s somewhere to
the east, and then you hit a coast, but no; try again, but no; try again, but no; ... if you don’t
have a map, that’s the best you can do. But NP hardness gives you the map: you can check that
there actually is no way through here.
But what is this notion of NP completness alluded to above? We observe that we can analyze
the complexity of computational problems by the complexity classcomplexity of the algorithms
that solve them. This gives us a notion of what to expect from solutions to a given problem class,
and thus whether efficient (i.e. polynomial time) algorithms can exist at all.
Reminder (?): NP and PSPACE (details ; e.g. [GJ79])

Turing Machine: Works on a tape consisting of cells, across which its Read/Write
head moves. The machine has internal states. There is a transition function that
specifies – given the current cell content and internal state – what the subsequent
internal state will be, how what the R/W head does (write a symbol and/or move).
Some internal states are accepting.
Decision problems are in NP if there is a non deterministic Turing machine that
halts with an answer after time polynomial in the size of its input. Accepts if at
least one of the possible runs accepts.
Decision problems are in NPSPACE, if there is a non deterministic Turing machine
that runs in space polynomial in the size of its input.
NP vs. PSPACE: Non-deterministic polynomial space can be simulated in de-
terministic polynomial space. Thus PSPACE = NPSPACE, and hence (trivially)
NP ⊆ PSPACE.
It is commonly believed that NP̸⊇PSPACE. (similar to P ⊆ NP)
The Utility of Complexity Knowledge (NP-Hardness)

6.2. RECAP: FORMAL LANGUAGES AND GRAMMARS 51
Assume: In 3 years from now, you have finished your studies and are working in
your first industry job. Your boss Mr. X gives you a problem and says Solve It!. By
which he means, write a program that solves it efficiently.
Question: Assume further that, after trying in vain for 4 weeks, you got the next
meeting with Mr. X. How could knowing about NP hardness help?
Answer: reserved for the plenary sessions ; be there!
6.2 Recap: Formal Languages and Grammars

One of the main ways of designing rational agents in this course will be to define formal languages
that represent the state of the agent environment and let the agent use various inference techniques
to predict effects of its observations and actions to obtain a world model. In this section we recap
the basics of formal languages and grammars that form the basis of a compositional theory for
them.
The Mathematics of Strings

Definition 6.2.1. An alphabet A is a finite set; we call each element a∈A a
character, and an n tuple s∈An a string (of length n over A).
Definition 6.2.2. Note that A0 = {⟨⟩}, where ⟨⟩ is the (unique) 0-tuple. With
the definition above we consider ⟨⟩ as the string of length 0 and call it the empty
string and denote it with ϵ.
Note:
Sets ̸= strings, e.g. {1, 2, 3} = {3, 2, 1}, but ⟨1, 2, 3⟩ =
̸ ⟨3, 2, 1⟩.
Notation:
We will often write a string ⟨c1 , . . ., cn ⟩ as "c1 . . .cn ", for instance "abc" for ⟨a, b, c⟩
Example 6.2.3.
Take A = {h, 1, /} as an alphabet. Each of the members h, 1, and / is a character.
The vector ⟨/, /, 1, h, 1⟩ is a string of length 5 over A.
Definition 6.2.4 (String Length). Given a string s we denote its length with |s|.
Definition 6.2.5.
The concatenation conc(s, t) of two strings s = ⟨s1 , ..., sn ⟩∈An and t = ⟨t1 , ..., tm ⟩∈Am
is defined as ⟨s1 , ..., sn , t1 , ..., tm ⟩∈An+m .
We will often write conc(s, t) as s + t or simply st
Example 6.2.6. conc("text", "book") = "text" + "book" = "textbook"
We have multiple notations for concatenation, since it is such a basic operation, which is used
so often that we will need very short notations for it, trusting that the reader can disambiguate
based on the context.
Now that we have defined the concept of a string as a sequence of characters, we can go on to
give ourselves a way to distinguish between good strings (e.g. programs in a given programming
language) and bad strings (e.g. such with syntax errors). The way to do this by the concept of a
formal language, which we are about to define.
Formal Languages
S
Definition 6.2.7. Let A be an alphabet, then we define the sets A+ := i∈N+ Ai
of nonempty string and A∗ :=A+ ∪ {ϵ} of strings.
Example 6.2.8. If A = {a, b, c}, then A∗ = {ϵ, a, b, c, aa, ab, ac, ba, . . . , aaa, . . . }.
Definition 6.2.9. A set L ⊆ A∗ is called a formal language over A.
Definition 6.2.10.
We use c[n] for the string that consists of the character c repeated n times.
Example 6.2.11.
#[5] = ⟨#, #, #, #, #⟩
Example 6.2.12.
The set M :={ba[n] |n∈N} of strings that start with character b followed by an
arbitrary numbers of a’s is a formal language over A = {a, b}.
Definition 6.2.13 (Operations on Languages).
Let L, L1 , and L2 be formal languages over the same alphabet, then we define
language level operations:
L1 L2 :={s1 s2 |s1 ∈L1 ∧ s2 ∈L2 }, L+ :={s+ |s∈L}, and L∗ :={s∗ |s∈L}.
There is a common misconception that a formal language is something that is difficult to under-
stand as a concept. This is not true, the only thing a formal language does is separate the “good”
from the bad strings. Thus we simply model a formal language as a set of stings: the “good”
strings are members, and the “bad” ones are not.
Of course this definition only shifts complexity to the way we construct specific formal languages
(where it actually belongs), and we have learned two (simple) ways of constructing them: by
repetition of characters, and by concatenation of existing languages.
As mentioned above, the purpose of a formal language is to distinguish “good” from “bad”
strings. It is maximally general, but not helpful, since it does not support computation and
inference. In practice we will be interested in formal languages that have some structure, so that
we can represent formal languages in a finite manner (recall that a formal language is a subset of
A∗ , which may be infinite and even undecidable – even though the alphabet A is finite).
To remedy this, we will now introduce phrase structure grammars (or just grammars), the
standard tool for describing structured formal languages.
Phrase Structure Grammars (Theory)

Recap: A formal language is an arbitrary set of symbol sequences.
Problem:
This may be infinite and even undecidable even if A is finite.
Idea: Find a way of representing formal languages with structure finitely.
Definition 6.2.14.
A phrase structure grammar (or just grammar) is a tuple ⟨N , Σ, P , S ⟩ where
N is a finite set of nonterminal symbols,
Σ is a finite set of terminal symbols, members of Σ ∪ N are called symbols.
∗ ∗
P is a finite set of grammar rules: pairs p:=h→b, where h∈(Σ ∪ N ) , N , (Σ ∪ N )
∗
and b∈(Σ ∪ N ) . The string h is called the head of p and b the body.
s∈N is a distinguished symbol called the start symbol (also sentence symbol).
Intuition: Grammar rules map strings with at least one nonterminal to arbitrary
other strings.
Notation:
If we have n rules h→bi sharing a head, we often write h→b1 | . . . | bn instead.
We fortify our intuition about these – admittedly very abstract – constructions by an example
and introduce some more vocabulary.
Phrase Structure Grammars (cont.)

Example 6.2.15. A simple phrase structure grammar G:
S → NP ; Vi
NP → Article; N
Article → the | a | an
N → dog | teacher | . . .
Vi → sleeps | smells | . . .
Here S , is the start symbol, NP , VP , Article, N , and Vi are nonterminals.
Definition 6.2.16. The subset of lexical rules, i.e. those whose body consists of a
single terminal is called its lexicon and the set of body symbols the alphabet. The
nonterminals in their heads are called lexical categories.
Definition 6.2.17. The non-lexicon grammar rules are called structural, and the
nonterminals in the heads are called phrasal categories.
Now we look at just how a grammar helps in analyzing formal languages. The basic idea is
that a grammar accepts a word, iff the start symbol can be rewritten into it using only the rules
of the grammar.
Phrase Structure Grammars (Theory)

Idea: Each symbol sequence in a formal language can be analyzed/generated by
the grammar.
Definition 6.2.18. Given a phrase structure grammar G:=⟨N , Σ, P , S ⟩, we say G
∗ ∗
derives t∈(Σ ∪ N ) from s∈(Σ ∪ N ) in one step, iff there is a grammar rule p∈P
∗
with p = h→b and there are u, v∈(Σ ∪ N ) , such that s = uhv and t = ubv. We
p
write s→G t (or s→G t if p is clear from the context) and use →∗G for the transitive
reflexive closure of →G . We call s→∗G t a G derivation of t from s.
Definition 6.2.19. Given a phrase structure grammar G:=⟨N , Σ, P , S ⟩, we say

∗
that s∈(N ∪ Σ) is a sentential form of G, iff S→∗G s. A sentential form that does
not contain nontermials is called a sentence of G, we also say that G accepts s.
Definition 6.2.20. The language L(G) of G is the set of its sentences.
Definition 6.2.21. Parsing, syntax analysis, or syntactic analysis is the process of

analyzing a string of symbols, either in a formal or a natural language by means of
a grammar.
Phrase Structure Grammars (Example)

Example 6.2.22. In the grammar G from Example 33.2.2:
1. Article; teacher; Vi is a sentential form,
S →G NP ; Vi
→G Article; N ; Vi
→G Article; teacher; Vi S → NP ; Vi
NP → Article; N
Article → the | a | an | . . .
2. The teacher sleeps is a sentence.
S →∗G Article; teacher; Vi Vi → sleeps | smells | . . .
→G the; teacher; Vi
→G the; teacher; sleeps
Note that this process indeed defines a formal language given a grammar, but does not provide
an efficient algorithm for parsing, even for the simpler kinds of grammars we introduce below.
Grammar Types (Chomsky Hierarchy [Cho65])

Observation: The shape of the grammar determines the “size” of its language.
Definition 6.2.23. We call a grammar and the formal language it accepts:
1. context-sensitive, if the bodies of grammar rules have no less symbols than the
heads,
2. context-free, if the heads have exactly one symbol,
3. regular, if additionally, bodies consist of a nonterminal, optionally followed by a

terminal symbol.
Example 6.2.24 (Languages and their Grammars).
Context-sensitive: The language {a[n] b[n] c[n] } is accepted by
S → a; b; c | A
A → a; A; B ; c | a; b; c
c; B → B; c
b; B → b; b
Context-free: The language {a[n] b[n] } is accepted by S →a; S ; b | ϵ.

Regular: The language {a[n] } is accepted by S →S ; a
Observation: Natural languages are probably context-sensitive but parsable in
real time! (like languages low in the hierarchy)
Useful Extensions of Phrase Structure Grammars

Definition 6.2.25. The Bachus Naur form or Backus normal form (BNF) is a
metasyntax notation for context-free grammars.
It extends the body of a grammar rule by mutiple (admissible) constructors:
alternative: s1 | . . . | sn ,
repetition: s∗ (arbitrary many s) and s+ (at least one s),
optional: [s] (zero or one times), and
grouping: (s1 ; . . . ; sn ), useful e.g. for repetition.
Example 6.2.26.
S ::= a; S; a
S ::= b; S; b
S ::= c
S ::= d
Observation: All of these can be eliminated, .e.g (; many more rules)

replace X→Z; (s∗ ); W with the grammar rules X→Z; Y ; W , Y →ϵ, and Y →Y ; s.
replace X→Z; (s+ ); W with the grammar rules X→Z; Y ; W , Y →s, and Y →Y ; s.
An Grammar Notation for AI-2

Problem: In grammars, notations for nonterminal symbols should be
short and mnemonic (for the use in the body)
close to the official name of the syntactic category (for the use in the head)
In AI-2 we will only use context-free grammars (simpler, but problem still applies)
in AI-2: I will try to give “grammar overviews” that combine those, e.g. the
grammar of first-order logic.
variables X ∈ V1
functions fk ∈ Σfk
predicates pk ∈ Σpk
terms t ::= X variable
| f0 constant
| f k (t1 , . . ., tk ) application
formulae A ::= pk (t1 , . . ., tk ) atomic
| ¬A negation
| A1 ∧ A2 conjunction
| ∀X A quantifier
We will generally get by with context-free grammars, which have highly efficient into parsing
algorithms, for the formal language we use in this course, but we will not cover the algorithms in
AI-2.
6.3 Mathematical Language Recap
Mathematical Structures
Observation: Mathematicians often cast object classes as mathematical struc-
tures.
We have just seen this: repeated here for convenience.
Definition 6.3.1.
A phrase structure grammar (or just grammar) is a tuple ⟨N , Σ, P , S ⟩ where
N is a finite set of nonterminal symbols,

Σ is a finite set of terminal symbols, members of Σ ∪ N are called symbols.
∗ ∗
P is a finite set of grammar rules: pairs p:=h→b, where h∈(Σ ∪ N ) , N , (Σ ∪ N )
∗
and b∈(Σ ∪ N ) . The string h is called the head of p and b the body.
s∈N is a distinguished symbol called the start symbol (also sentence symbol).
Observation: Even though we call grammar rules “pairs” above, they are also
mathematical structures ⟨h, b⟩ with a funny notation h→b.
Mathematical Structures in Programming

6.3. MATHEMATICAL LANGUAGE RECAP 57
Most programming languages have some way of creating “named structures”. Ref-
erencing components is usually done via “dot notation”
Example 6.3.2 (Structs in C).
// Create strutures grule grammar
struct grule {
char[][] head;
char[][] body;
}
struct grammar {
char[][] nterminals;
char[][] termininals;
grule[] grules;
char[] start;
}
int main() {
struct grule r1;
r1.head = "foo";
r1.body = "bar";
}
In AI-2 we use a mixture between Math and Programming Styles

In AI-2 we use mathematical notation, . . .
I will try to always give “structure overviews”, that combine notations with “type”
information and accessor names, e.g.
* N set nonterminal symbols, +

Σ set terminal symbols,
grammar =
P {h→b| . . . } grammar rules,
S N start symbol
∗ ∗
h (Σ ∪ N ) , N , (Σ ∪ N ) head,
grammar rule h→b = ∗
b (Σ ∪ N ) body

Chapter 7
Rational Agents: a Unifying

Framework for Artificial Intelligence
In this chapter, we introduce a framework that gives a comprehensive conceptual model for the
multitude of methods and algorithms we cover in this course. The framework of rational agents
accommodates two traditions of AI.
Initially, the focus of AI research was on symbolic methods concentrating on the mental processes
of problem solving, starting from Newell/Simon’s “physical symbol hypothesis”:
A physical symbol system has the necessary and sufficient means for general intelligent action.
[NS76]
Here a symbol is a representation an idea, object, or relationship that is physically manifested in
(the brain of) an intelligent agent (human or artificial).
Later – in the 1980s – the proponents of embodied AI posited that most features of cognition,
whether human or otherwise, are shaped – or at least critically influenced – by aspects of the
entire body of the organism. The aspects of the body include the motor system, the perceptual
system, bodily interactions with the environment (situatedness) and the assumptions about the
world that are built into the structure of the organism. They argue that symbols are not always
necessary since
The world is its own best model. It is always exactly up to date. It always has every detail
there is to be known. The trick is to sense it appropriately and often enough. [Bro90]
The framework of rational agents initially introduced by Russell and Wefald in [RW91] – ac-
commodates both, it situates agents with percepts and actions in an environment, but does not
preclude physical symbol systems – i.e. systems that manipulate symbols as agent functions. Rus-
sell and Norvig make it the central metaphor of their book “Artificial Intelligence – A modern
approach” [RN03], which we follow in this course.
7.1 Introduction: Rationality in Artificial Intelligence

We now introduce the notion of rational agents as entities in the world that act optimally (given
the available information). We situate rational agents in the scientific landscape by looking at
variations of the concept that lead to slightly different fields of study.
What is AI? Going into Details

Recap: AI studies how we can make the computer do things that humans can still
do better at the moment. (humans are proud to be rational)
59
60 CHAPTER 7. RATIONAL AGENTS: AN AI FRAMEWORK
What is AI?: Four possible answers/facets: Systems that
think like humans think rationally

act like humans act rationally
expressed by four different definitions/quotes:
Humanly Rational
Thinking “The exciting new effort “The formalization of mental
to make computers think faculties in terms of computa-
. . . machines with human-like tional models” [CM85]
minds” [Hau85]
Acting “The art of creating machines “The branch of CS concerned
that perform actions requiring with the automation of appro-
intelligence when performed by priate behavior in complex situ-
people” [Kur90] ations” [LS93]
Idea: Rationality is performance-oriented rather than based on imitation.
So, what does modern AI do?

Acting Humanly: Turing test, not much pursued outside Loebner prize
b building pigeons that can fly so much like real pigeons that they can fool
=
pigeons
Not reproducible, not amenable to mathematical analysis
Thinking Humanly: ; Cognitive Science.

How do humans think? How does the (human) brain work?
Neural networks are a (extremely simple so far) approximation
Thinking Rationally: Logics, Formalization of knowledge and inference
You know the basics, we do some more, fairly widespread in modern AI

Acting Rationally: How to make good action choices?
Contains logics (one possible way to make intelligent decisions)
We are interested in making good choices in practice (e.g. in AlphaGo)
We now discuss all of the four facets in a bit more detail, as they all either contribute directly
to our discussion of AI methods or characterize neighboring disciplines.
Acting humanly: The Turing test

Introduced by Alan Turing (1950) “Computing machinery and intelligence” [Tur50]:
7.1. INTRODUCTION: RATIONALITY IN ARTIFICIAL INTELLIGENCE 61
“Can machines think?” −→ “Can machines behave intelligently?”

Definition 7.1.1. The Turing test is an operational test for intelligent behavior
based on an imitation game over teletext (arbitrary topic)
It was predicted that by 2000, a machine might have a 30% chance of fooling a lay
person for 5 minutes.
Note: In [Tur50], Alan Turing
anticipated all major arguments against AI in following 50 years and

suggested major components of AI: knowledge, reasoning, language understand-
ing, learning
Problem: Turing test is not reproducible, constructive, or amenable to mathe-
matical analysis
Thinking humanly: Cognitive Science

1960s: “cognitive revolution”: information processing psychology replaced prevail-
ing orthodoxy of behaviorism.
Requires scientific theories of internal activities of the brain
What level of abstraction? “Knowledge” or “circuits”?

How to validate?: Requires
1. Predicting and testing behavior of human subjects or (top-down)
2. Direct identification from neurological data. (bottom-up)
Definition 7.1.2. Cognitive Science is the interdisciplinary, scientific study of the

mind and its processes. It examines the nature, the tasks, and the functions of
cognition.
Definition 7.1.3. Cognitive Neuroscience studies the biological processes and as-
pects that underlie cognition, with a specific focus on the neural connections in the
brain which are involved in mental processes.
Both approaches/disciplines are now distinct from AI.
Both share with AI the following characteristic: the available theories do not explain
(or engender) anything resembling human-level general intelligence
Hence, all three fields share one principal direction!

Thinking rationally: Laws of Thought

Normative (or prescriptive) rather than descriptive
Aristotle: what are correct arguments/thought processes?

Several Greek schools developed various forms of logic: notation and rules of
derivation for thoughts; may or may not have proceeded to the idea of mechaniza-
tion.
Direct line through mathematics and philosophy to modern AI

Problems
1. Not all intelligent behavior is mediated by logical deliberation
2. What is the purpose of thinking? What thoughts should I have out of all the
thoughts (logical or otherwise) that I could have?
Acting Rationally
Idea: Rational behavior =
b doing the right thing!
Definition 7.1.4. Rational behavior consists of always doing what is expected to
maximize goal achievement given the available information.
Rational behavior does not necessarily involve thinking e.g., blinking reflex — but
thinking should be in the service of rational action.
Aristotle: Every art and every inquiry, and similarly every action and pursuit, is
thought to aim at some good. (Nicomachean Ethics)
The Rational Agents

Definition 7.1.5. An agent is an entity that perceives and acts.
Central Idea: This course is about designing agent that exhibit rational behavior,
i.e. for any given class of environments and tasks, we seek the agent (or class of
agents) with the best performance.
Caveat: Computational limitations make perfect rationality unachievable

; design best program for given machine resources.

7.2. AGENT/ENV. AS A FRAMEWORK 63
7.2 Agents and Environments as a Framework for AI

Agents and Environments

Definition 7.2.1. An agent is anything that
perceives its environment via sensors (a means of sensing the environment)
acts on it with actuators (means of changing the environment).
Example 7.2.2. Agents include humans, robots, softbots, thermostats, etc.
Modeling Agents Mathematically and Computationally

Definition 7.2.3. A percept is the perceptual input of an agent at a specific
instant.
Definition 7.2.4. Any recognizable, coherent employment of the actuators of an

agent is called an action.
Definition 7.2.5. The agent function f a of an agent a maps from percept histories
to actions:
f a : P ∗ →A
We assume that agents can always perceive their own actions. (but not necessarily
their consequences)
Problem: agent functions can become very big (theoretical tool only)
Definition 7.2.6. An agent function can be implemented by an agent program
that runs on a physical agent architecture.
Agent Schema: Visualizing the Internal Agent Structure

Agent Schema: We will use the following kind of schema to visualize the internal
structure of an agent:
Section 2.1.
64
Agents and Environments CHAPTER 7. RATIONAL AGENTS: AN AI FRAMEWORK
35
Agent Sensors
Percepts
Environment
?
Actions
Actuators
Figure 2.1 Agents interact with environments through sensors and actuators.
Different agents differ on the contents of the white box in the center.
there is to say about the agent.

Michael Kohlhase:Mathematically
Artificial Intelligence 2 speaking, 83 we say that an agent’s behavior is
2023-02-10
AGENT FUNCTION described by the agent function that maps any given percept sequence to an action.
We can imagine tabulating the agent function that describes any given agent; for most
Example:
agents, Vacuum-Cleaner
this would be a very large table—infinite, World andin Agent fact, unless we place a bound on the
length of percept sequences we want to consider.Percept Givensequence
an agent to experiment with, Actionwe can,
in principle, construct this table by trying out all possible percept sequences and
[A, Clean] recording
Right
[A, Dirty] Suck
which actions the agent does in response.1 The table is,
[B, Clean]
of course, an external characterization
Lef t
of the agent. Internally, the agent function for an [B, artificial
Dirty] agent will be implemented Suck by an
AGENT PROGRAM agent program. It is important to keep these [A, twoClean],
ideas [A, distinct.
Clean] The agent function Right is an
[A, Clean], [A, Dirty] Suck
abstract mathematical description; the agent program is a concrete
[A, Clean], [B, Clean] implementation, Lef t running
within some physical system. [A, Clean], [B, Dirty] Suck
percepts:
To illustrate these ideas,location we use and acon- [A, Dirty],
very simple [A, Clean]
example—the vacuum-cleaner Right world
[A, Dirty], [A, Dirty] Suck
tents, e.g.,
shown in Figure 2.2. This world is so simple that
[A, Dirty] .. we can describe everything that . happens;
it’s also a made-up world, so we can invent many . variations. This particular world.. has just two
actions: Lef t, Right, Suck, [A, Clean], [A, Clean], [A, Clean] Right
locations: squares A and B. The vacuum agent[A,perceives
N oOp Clean], [A, which square
Clean], it is in Suck
[A, Dirty] and whether
there is dirt in the square. It can choose to move .
.. left, move right, suck up the .. dirt, or do
.
nothing. One very simple agent function is the following: if the current square is dirty, then
suck; otherwise,
Science move to the
Question: otherissquare.
What the right A partial tabulation of this agent function is shown
agent function?
in Figure 2.3 and an agent program that implements it appears in Figure 2.8 on page 48.
AI Question: Is there an agent architecture and an agent program that implements
Looking at Figure 2.3, we see that various vacuum-world agents can be defined simply
it.
by filling in the right-hand column in various ways. The obvious question, then, is this: What
is the right way to fill out the table? In other words, what makes an agent good or bad,
intelligent or stupid? We answer these questions in the next section.
1
Example: Vacuum-Cleaner World and Agent
If the agent uses some randomization to choose its actions, then we would have to try each sequence many
times to identify the probability of each action. One might imagine that acting randomly is rather silly, but we
show later in this chapter that it can be very intelligent.
Example 7.2.7 (Agent Program).
procedure Reflex−Vacuum−Agent [location,status] returns an action

if status = Dirty then return Suck
else if location = A then return Right
else if location = B then return Left
7.3. GOOD BEHAVIOR ; RATIONALITY 65
Table-Driven Agents
Idea: We can just implement the agent function as a table and look up actions.
We can directly implement this:

function Table−Driven−Agent(percept) returns an action
persistent table /∗ a table of actions indexed by percept sequences ∗/
var percepts /∗ a sequence, initially empty ∗/
append percept to the end of percepts
action := lookup(percepts, table)
return action
Problem: Why is this not a good idea?
The table is much too large: even with n binary percepts whose order of occur-
rence does not matter, we have 2n rows in the table.
Who is supposed to write this table anyways, even if it “only” has a million
entries?
7.3 Good Behavior ; Rationality

Rationality
Idea: Try to design agents that are successful! (aka. “do the right thing”)
Definition 7.3.1. A performance measure is a function that evaluates a sequence
of environments.
Example 7.3.2. A performance measure for the vacuum cleaner world could
award one point per square cleaned up in time T ?

award one point per clean square per time step, minus one per move?
penalize for > k dirty squares?
Definition 7.3.3. An agent is called rational, if it chooses whichever action max-
imizes the expected value of the performance measure given the percept sequence
to date.
Question: Why is rationality a good quality to aim for?
Consequences of Rationality: Exploration, Learning, Autonomy

Note: a rational agent need not be perfect

only needs to maximize expected value (rational ̸= omniscient)
need not predict e.g. very unlikely but catastrophic events in the future
percepts may not supply all relevant information (rational ̸= clairvoyant)
if we cannot perceive things we do not need to react to them.
but we may need to try to find out about hidden dangers (exploration)
action outcomes may not be as expected (rational ̸= successful)
but we may need to take action to ensure that they do (more often)
(learning)
Note: rational ; exploration, learning, autonomy
Definition 7.3.4. An agent is called autonomous, if it does not rely on the prior
knowledge about the environment of the designer.
Autonomy avoids fixed behaviors that can become unsuccessful in a changing en-
vironment. (anything else would be
irrational)
The agent has to learning agentlearn all relevant traits, invariants, properties of the
environment and actions.
PEAS: Describing the Task Environment

Observation: To design a rational agent, we must specify the task environment in
terms of performance measure, environment, actuators, and sensors, together called
the PEAS components.
Example 7.3.5. When designing an automated taxi:
Performance measure: safety, destination, profits, legality, comfort, . . .

Environment: US streets/freeways, traffic, pedestrians, weather, . . .
Actuators: steering, accelerator, brake, horn, speaker/display, . . .
Sensors: video, accelerometers, gauges, engine sensors, keyboard, GPS, . . .
Example 7.3.6 (Internet Shopping Agent).
The task environment:
Performance measure: price, quality, appropriateness, efficiency
Environment: current and future WWW sites, vendors, shippers
Actuators: display to user, follow URL, fill in form
Sensors: HTML pages (text, graphics, scripts)

7.4. CLASSIFYING ENVIRONMENTS 67
Examples of Agents: PEAS descriptions

Agent Type Performance Environment Actuators Sensors
measure
Chess/Go player win/loose/draw game board moves board position
Medical diagno- accuracy of di- patient, staff display ques- keyboard entry
sis system agnosis tions, diagnoses of symptoms
Part-picking percentage of conveyor belt jointed arm and camera, joint
robot parts in correct with parts, bins hand angle sensors
bins
Refinery con- purity, yield, refinery, opera- valves, pumps, temperature,
troller safety tors heaters, displays pressure, chem-
ical sensors
Interactive En- student’s score set of students, display exer- keyboard entry
glish tutor on test testing accuracy cises, sugges-
tions, correc-
tions
Agents
Which are agents?
(A) James Bond.
(B) Your dog.
(C) Vacuum cleaner.
(D) Thermometer.
7.4 Classifying Environments

It is important to understand that the type of the environment has a very profound effect on
the agent design. Depending on the type, different types of agents are needed to be successful.
So before we discuss common types of agents in section 7.5, we will classify types of environ-
ments.
Environment types
Observation 7.4.1. Agent design is largely determined by the type of environment
it is intended for.
Problem:
There is a vast number of possible kinds of environments in AI.
Solution: Classify along a few “dimensions”. (independent characteristics)

Definition 7.4.2. For an agent a we classify the environment e of a by its type,

which is one of the following. We call e
1. fully observable, iff the a’s sensors give it access to the complete state of the
environment at any point in time, else partially observable.
2. deterministic, iff the next state of the environment is completely determined by
the current state and a’s action, else stochastic.
3. episodic, iff a’s experience is divided into atomic episodes, where it perceives and
then performs a single action. Crucially the next episode does not depend on
previous ones. Non-episodic environments are called sequential.
4. dynamic, iff the environment can change without an action performed by a, else
static. If the environment does not change but a’s performance measure does,
we call e semidynamic.
5. discrete, iff the sets of e’s state and a’s actions are countable, else continuous.
6. single agent, iff only a acts on e; else multi agent (when must we count parts of
e as agents?)
Some examples will help us understand the classification of environments better.
Environment Types (Examples)

Example 7.4.3. Some environments classified:
Solitaire Backgammon Internet shopping Taxi

fully observable No Yes No No
deterministic Yes No Partly No
episodic No No No No
static Yes Semi Semi No
discrete Yes Yes Yes No
single agent Yes No Yes (except auctions) No
Observation 7.4.4.
The real world is (of course) a partially observable, stochastic, sequential, dynamic,
continuous, and multi agent environment. (worst case for AI)
In the AI-2 course we will work our way from the simpler environment types to the more
general ones. Each environment type wil need its own agent types specialized to surviving and
doing well in them.
7.5 Types of Agents
We will now discuss the main types of agents we will encounter in this course, get an impression
of the variety, and what they can and cannot do. We will start from simple reflex agents, add
state, and utility, and finally add learning. A Video Nugget covering this section can be found
at https://fau.tv/clip/id/21926.
7.5. TYPES OF AGENTS 69
Agent types
Observation: So fare we have described (and analyzed) agents only by their
behavior (cf. agent function f : P ∗ →A).
Problem:
This does not help us to build agents. (the goal of AI)
To build an agent, we need to fix an agent architecture and come up with an agent
program that runs on it.
Preview: Four basic types of agent architectures in order of increasing generality:
1. simple reflex agents
2. model based agents
3. goal based agents
4. utility based agents
All these can be turned into learning agents.
Simple reflex agents

Definition 7.5.1. A simple reflex agent is an agent a that only bases its actions
on the last percept: so the agent function simplifies to f a : P→A.
Agent
Section 2.4. Schema:
The Structure of Agents 49
Agent Sensors
What the world

is like now
Environment
Condition-action rules What action I

should do now
Actuators
Figure 2.9 Schematic diagram of a simple reflex agent.

function S IMPLE -R EFLEX -AGENT( percept ) returns an action
procedure Reflex−Vacuum−Agent [location,status]
persistent: rules, a set of condition–action rules returns an action
if status = Dirty then . . .
state ← I NTERPRET-I NPUT( percept )
rule ← RULE -M ATCH(state, rules)
action ← rule.ACTION
return action
Figure 2.10 A simple reflex agent. It acts according to a rule whose condition matches
the current state, as defined by the percept.
Simple reflex agents (continued)

trivial; it gets more interesting shortly.) We use rectangles to denote the current internal state
of the agent’s decision process, and ovals to represent the background information used in
the process. The agent program, which is also very simple, is shown in Figure 2.10. The
I NTERPRET-I NPUT function generates an abstracted description of the current state from the
percept, and the RULE -M ATCH function returns the first rule in the set of rules that matches
the given state description. Note that the description in terms of “rules” and “matching” is
purely conceptual; actual implementations can be as simple as a collection of logic gates
General Agent Program:
function Simple−Reflex−Agent (percept) returns an action

persistent: rules /∗ a set of condition−action rules∗/
state := Interpret−Input(percept)
rule := Rule−Match(state,rules)
action := Rule−action[rule]
return action
Problem: Simple reflex agents can only react to the perceived state of the envi-
ronment, not to changes.
Example 7.5.3. Automobile tail lights signal braking by brightening. A simple
reflex agent would have to compare subsequent percepts to realize.
Problem: Partially observable environments get simple reflex agents into trouble.
Example 7.5.4. Vacuum cleaner robot with defective location sensor ; infinite
loops.
Model-based Reflex Agents: Idea

Idea: Keep track of the state of the world we cannot see in an internal model.
Section 2.4. The Structure of Agents 51
Agent Schema:
Sensors
State
How the world evolves What the world
is like now
Environment
What my actions do

should do now
Agent Actuators
Figure 2.11 A model-based reflex agent.

function M ODEL -BASED -R EFLEX -AGENT( percept ) returns an action

persistent: state, the agent’s current conception of the world state
Model-based Reflex Agents: Definition
model , a description of how the next state depends on current state and action
rules, a set of condition–action rules
action, the most recent action, initially none
Definition 7.5.5. A model based agent (also called reflex agent with state) is an
agentstate
whose PDATE -S TATE
← Ufunction (state, action
depends on , percept , model )
a action
world←model: a set S of possible states.
rule.ACTION
return action
a sensor model S that given a state s and percepts determines a new state s′ .
Figure 2.12 A model-based reflex agent. It keeps track of the current state of the world,
using an internal model. It then chooses an action in the same way as the reflex agent.
is responsible for creating the new internal state description. The details of how models and
7.5. TYPES OF AGENTS 71
(optionally) a transition model T , that predicts a new state s′′ from a state s′
and an action a .
An action function f that maps (new) states to actions.
The agent function is iteratively computed via e7→f (S(s, e)).
Note: As different percept sequences lead to different states, so the agent function
f a : P ∗ →A no longer depends only on the last percept.
Example 7.5.6 (Tail Lights Again). Model based agents can do the 96 if the
states include a concept of tail light brightness.
Model-Based Agents (continued)

Observation 7.5.7. The agent program for a model based agent is of the following
form:
function Model−Based−Agent (percept) returns an action
var state /∗ a description of the current state of the world ∗/
persistent rules /∗ a set of condition−action rules ∗/
var action /∗ the most recent action, initially none ∗/
state := Update−State(state,action,percept)
rule := Rule−Match(state,rules)
action := Rule−action(rule)
return action
Problem: Having a world model does not always determine what to do (ratio-
nally).
Example 7.5.8. Coming to an intersection, where the agent has to decide between
going left and right.
Goal-based agents
Problem:
A world model does not always determine what to do (rationally).
Observation: Having a goal in mind does! (determines future actions)
Agent Schema:
52
72 Chapter
CHAPTER 7. RATIONAL 2.
AGENTS: Intelligent
AN Agents
AI FRAMEWORK
Sensors
State
What the world
How the world evolves is like now
Environment
What it will be like
What my actions do if I do action A
What action I
Goals should do now
Agent Actuators
Figure 2.13 A model-based, goal-based agent. It keeps track of the world state as well as
a set of goals it is trying to achieve, and chooses an action that will (eventually) lead to the
achievement ofMichael Kohlhase: Artificial Intelligence 2
its goals. 100 2023-02-10
Goal-based
example, theagents
taxi may be(continued)
driving back home, and it may have a rule telling it to fill up with
gas on the way home unless it has at least half a tank. Although “driving back home” may
seem to an aspect of the world state, the fact of the taxi’s destination is actually an aspect of
Definition 7.5.9. state.
the agent’s internal A goal based
If you findagent is a model
this puzzling, basedthat
consider agent withcould
the taxi transition model
be in exactly
that
Tthe deliberates
same place at theactions based
same time, but on goals to
intending and a world
reach model:
a different It employs
destination.
′
a set G of goals and a goal function f that given a (new) state s selects an
2.4.4 Goal-based agents
action a to best reach G.
Knowing something about the current state of the environment is not always enough to decide
The
whataction
to do. function is then
For example, →f junction,
at a s7road (T (s), G).
the taxi can turn left, turn right, or go straight
on. The correct decision depends on where the taxi is trying to get to. In other words, as well
GOAL
Observation:
as a current stateAdescription,
goal based theagent
agent is more
needs flexible
some in goal
sort of the knowledge it can
information that utilize.
describes
situations that
Example are desirable—for
7.5.10. example,
A goal based agentbeing
canat easily
the passenger’s destination.
be changed to go The
to aagent
new
program can combine this with the model (the same information as was used in the model-
destination, a model based agent’s rules make it go to exactly one destination.
based reflex agent) to choose actions that achieve the goal. Figure 2.13 shows the goal-based
agent’s structure.
Sometimes goal-based action
Michael Kohlhase: Artificialselection
Intelligence 2 is straightforward—for
101 example, when goal sat-
2023-02-10
isfaction results immediately from a single action. Sometimes it will be more tricky—for
example, when the agent has to consider long sequences of twists and turns in order to find a
Utility-based the goal. Search (Chapters 3 to 5) and planning (Chapters 10 and 11) are the
way to achieve agents
subfields of AI devoted to finding action sequences that achieve the agent’s goals.
Notice that decision making of this kind is fundamentally different from the condition–
Definition 7.5.11. earlier,
action rules described A utility in thatbased agent consideration
it involves uses a worldofmodel along with
the future—both a utility
“What will
function
happen ifthat
I do models its preferences
such-and-such?” and “Will among
that makethe me
states of that
happy?” world.
In the reflex It chooses
agent the
designs,
action that leadsistonotthe
this information best expected
explicitly represented, utility.
because the built-in rules map directly from
Agent Schema:
54
7.5. TYPES OF AGENTS Chapter 2. Intelligent Agents 73
Sensors
State
What the world
Environment
Utility How happy I will be

in such a state
What action I
should do now
Agent Actuators
Figure 2.14 A model-based, utility-based agent. It uses a model of the world, along with
a utility function that measures its preferences among states of the world. Then it chooses the
action that leadsMichael
to theKohlhase: Artificial Intelligence 2
best expected 102
utility, where expected 2023-02-10
utility is computed by averaging
over all possible outcome states, weighted by the probability of the outcome.
Utility-based vs. Goal-based Agents

outcome. (Appendix A defines expectation more precisely.) In Chapter 16, we show that any
rational agent must behave as if it possesses a utility function whose expected value it tries
Question:
to maximize. What
An agent is the
that difference
possesses anbetween goal based
explicit utility functionandcanutility based agents?
make rational decisions
with a general-purpose algorithm that does not depend on
Utility-based Agents are a Generalization: We can always force goal-directedness the specific utility function being
maximized. In this way, the “global” definition of rationality—designating as rational those
by a utility function that only rewards goal states.
agent functions that have the highest performance—is turned into a “local” constraint on
rational-agent Agents
Goal-based designs thatcancan dobeless: expressed in a simple
A utility program.
function allows rational decisions where
The utility-based
mere goals are inadequate: agent structure appears in Figure 2.14. Utility-based agent programs
appear in Part IV, where we design decision-making agents that must handle the uncertainty
conflicting
inherent goals or partially observable
in stochastic (utilityenvironments.
gives tradeoff to make rational decisions)
At this point, the reader may be wondering, “Is it that simple? We just build agents that
goals obtainable by uncertain actions (utility × likelihood helps)
maximize expected utility, and we’re done?” It’s true that such agents would be intelligent,
but it’s not simple. A utility-based agent has to model and keep track of its environment,
tasks that have involved a great deal of research on perception,
Michael Kohlhase: Artificial Intelligence 2 103
representation,
2023-02-10
reasoning,
and learning. The results of this research fill many of the chapters of this book. Choosing
the utility-maximizing course of action is also a difficult task, requiring ingenious algorithms
Learning Agents
that fill several more chapters. Even with these algorithms, perfect rationality is usually
unachievable in practice because of computational complexity, as we noted in Chapter 1.
Definition 7.5.12. A learning agent is an agent that augments the performance

2.4.6 Learning agents
element – which determines actions from percept sequences with
We have described agent programs with various methods for selecting actions. We have
a so
not, learning element
far, explained howwhich makes
the agent improvements
programs come intotobeing.
the agent’s components,
In his famous early paper,
Turing (1950)
a critic considers
which givesthe idea of actually
feedback programming
to the learning his intelligent
element based onmachines by hand.
an external per-
formance standard,
a problem generator which suggests actions that lead to new and informative
experiences.
The performance element is what we took for the whole agent above.
Learning Agents
Agent Schema:
Performance standard
Critic Sensors
feedback
Environment
changes
Learning Performance
element element
knowledge
learning
goals
Problem
generator
Actuators
Agent
Figure 2.15 A general learning agent.

He estimates how much

Learning Agents:workExamplethis might take and concludes “Some more expeditious method
seems desirable.” The method he proposes is to build learning machines and then to teach
them. In many
Example areas of AI,
7.5.13 this isTaxi
(Learning nowAgent). the preferred
It has themethod
componentsfor creating state-of-the-art
systems. Learning has another advantage, as we noted earlier: it allows the agent to operate
Performance element: the knowledge and procedures for selecting driving actions.
in initially unknown environments and to become more competent than its initial knowledge
(this controls the actual driving)
alone mightallow. In this section, we briefly introduce the main ideas(e.g.
critic: observes the world and informs the learning element
of learning
when
agents.
Throughout the book, we
passengers comment
complain brutalon opportunities and methods for learning in particular
braking)
kinds of agents. Part V goes into much more depth on the learning algorithms (e.g.
Learning element modifies the braking rules in the performance element
themselves.
A learning agent can
earlier, softer) be divided into four conceptual components, as shown in Fig-
LEARNING ELEMENT ure 2.15. The most generator
Problem importantmight distinction
experiment is with
between
brakingtheon learning
different road element,
surfaces which is re-
PERFORMANCE
sponsible for making improvements, and the performance element, which is responsible for
ELEMENT
The learning element can make changes to any “knowledge components” of the
selecting external
diagram, actions.
e.g. in theThe performance element is what we have previously considered
to be the entire agent: it takes in percepts and decides on actions. The learning element uses
model from the percept sequence (how the world evolves)
CRITIC feedback from the critic on how the agent is doing and determines how the performance
success likelihoods by observing action outcomes (what my actions do)
element should be modified to do better in the future.
Thedesign of the learning
Observation: here, the element
passengerdepends complaints very much
serve onofthe
as part thedesign
“externalof perfor-
the performance
element. When trying to design an agent that learns a certain capability, the offirst
mance standard” since they correlate to the overall outcome – e.g. in form tipsquestion is
or blacklists.
not “How am I going to get it to learn this?” but “What kind of performance element will my
agent need to do this once it has learned how?” Given an agent design, learning mechanisms
can be constructed to improve every part of the agent.
The critic tells the learning element how well the agent is doing with respect to a fixed
Domain-Specific
performance standard. Thevs. criticGeneral
is necessary Agents because the percepts themselves provide no
indication of the agent’s success. For example, a chess program could receive a percept
indicating that it has checkmated its opponent, but it needs a performance standard to know
that this is a good thing; the percept itself does not say so. It is important that the performance
7.6. REPRESENTING THE ENVIRONMENT IN AGENTS 75
Domain-Specific Agent vs. General Agent
vs.
Solver specific to a particular prob- vs. Solver based on description in a
lem (“domain”). general problem-description language
(e.g., the rules of any board game).
More efficient. vs. Much less design/maintenance work.
What kind of agent are you?
7.6 Representing the Environment in Agents

We now come to a very important topic, which has a great influence on agent design: how
does the agent represent the environment. After all, in all agent designs above (except the simple
reflex agent) maintain a notion of world state and how the world state evolves given percepts and
actions. The form of this model determines the algorithms.
Representing the Environment in Agents

We have seen various components of agents that answer questions like
What is the world like now?
What action should I do now?
What do my actions do?
Next natural question: How do these work? (see the rest of the course)
Important Distinction: How the agent implement the wold model.
Definition 7.6.1. We call a state representation
atomic, iff it has no internal structure (black box)

factored, iff each state is characterized by attributes and their values.
structured, iff the state includes representations of objects and their relation-
ships.
Atomic/Factored/Structured State Representations

Schematically: we can visualize the three kinds by
B C
B C
(a) Atomic (b) Factored (b) Structured
Example 7.6.2. Consider the problem of finding a driving route from one end of
a country to the other via some sequence of cities.
In an atomic representation the state is represented by the name of a city.
In a factored representation we may have attributes “gps-location”, “gas”,. . .
(allows information sharing between states and uncertainty)
But how to represent a situation, where a large truck blocking the road, since it
is trying to back into a driveway, but a loose cow is blocking its path. (attribute
“TruckAheadBackingIntoDairyFarmDrivewayBlockedByLooseCow” is unlikely)
In a structured representation, we can have objects for trucks, cows, etc. and
their relationships.
Summary
Agents interact with environments through actuators and sensors.
The agent function describes what the agent does in all circumstances.
The performance measure evaluates the environment sequence.
A perfectly rational agent maximizes expected performance.

Agent programs implement (some) agent functions.
PEAS descriptions define task environments.
Environments are categorized along several dimensions:

fully observable? deterministic? episodic? static? discrete? single agent?
Several basic agent architectures exist:
reflex, model based, goal based, utility based

Part II
General Problem Solving
77
79
This part introduces search-based methods for general problem solving using atomic and fac-
tored representations of states.
Concretely, we discuss the basic techniques of search-based symbolic AI. First in the shape of
classical and heuristic search and adversarial search paradigms. Then in constraint propagation,
where we see the first instances of inference-based methods.
80
Chapter 8
Problem Solving and Search
In this chapter, we will look at a class of algorithms called search algorithms. These are
algorithms that help in quite general situations, where there is a precisely described problem, that
needs to be solved. Hence the name “General Problem Solving” for the area.
8.1 Problem Solving

Before we come to the search algorithms themselves, we need to get a grip on the types of
problems themselves and how we can represent them, and on what the variuous types entail for
the problem solving process.
The first step is to classify the problem solving process by the amount of knowledge we have
available. It makes a difference, whether we know all the factors involved in the problem before
we actually are in the situation. In this case, we can solve the problem in the abstract, i.e. make
a plan before we actually enter the situation (i.e. offline), and then when the problem arises, only
execute the plan. If we do not have complete knowledge, then we can only make partial plans, and
have to be in the situation to obtain new knowledge (e.g. by observing the effects of our actions or
the actions of others). As this is much more difficult we will restrict ourselves to offline problem
solving.
Problem Solving: Introduction

Recap: Agents perceive the environment and compute an action.
In other words: Agents continually solve “the problem of what to do next”.

AI Goal: Find algorithms that help solving problems in general.
Idea: If we can describe/represent problems in a standardized way, we may have
a chance to find general algorithms.
Concretely: We will use the following two concepts to describe problems

States: A set of possible situations in our problem domain (=
b environments)
Actions: that get us from one state to another. (=
b agents)
A sequence of actions is a solution, if it brings us from an initial state to a goal
state. Problem solving computes solutions from problem formulations.
81
82 CHAPTER 8. PROBLEM SOLVING AND SEARCH
Definition 8.1.1. In offline problem solving an agent computing an action sequence

based complete knowledge of the environment.
Remark 8.1.2. Offline problem solving only works in fully observable, deterministic,
static, and episodic environments.
Definition 8.1.3. In online problem solving an agent computes one action at a

time based on incoming perceptions.
This Semester: We largely restrict ourselves to offline problem solving. (easier)
We will use the following problem as a running example. It is simple enough to fit on one slide
and complex enough to show the relevant features of the problem solving algorithms we want to
talk about.
Example: Traveling in Romania

Scenario: An agent is on holiday in Romania; currently in Arad; flight home leaves
tomorrow
68 from Bucharest; how to get there?
Chapter We
3. have
SolvingaProblems
map:by Searching
Oradea
71
Neamt
Zerind 87
75 151
Iasi
Arad
140
92
Sibiu Fagaras
99
118
Vaslui
80
Rimnicu Vilcea
Timisoara
142
111 Pitesti 211
Lugoj 97
70 98
85 Hirsova
Mehadia 146 101 Urziceni
75 138 86
Bucharest
Drobeta 120
90
Craiova Eforie
Giurgiu
Figure 3.2 A simplified road map of part of Romania.
Formulate the Problem:

Sometimes the goal is specified by an abstract property rather than an explicitly enumer-
ated set of states. For example, in chess, the goal is to reach a state called “checkmate,”
States: various
wherecities.
the opponent’s king is under attack and can’t escape.
• A path cost function that assigns a numeric cost to each path. The problem-solving
Actions: drive
PATH COST
between cities.
agent chooses a cost function that reflects its own performance measure. For the agent
trying to get to Bucharest, time is of the essence, so the cost of a path might be its length
Solution: Appropriate
in kilometers.sequence
In this chapter,of
we cities, e.g.:
assume that the costArad,
of a pathSibiu, Fagaras,
can be described as the Bucharest
STEP COST sum of the costs of the individual actions along the path.3 The step cost of taking action
a in state s to reach state s! is denoted by c(s, a, s! ). The step costs for Romania are
shown in Figure 3.2 as route distances. We assume that step costs are nonnegative.4
The preceding elements define a problem and can be gathered into a single data structure
that is given as input to a problem-solving algorithm. A solution to a problem is an action
Given this example to sequence fortifythatourleadsintuitions, wetocan
from the initial state a goal now turn quality
state. Solution to the formal
is measured definition
by the of problem
path cost function, and an optimal solution has the lowest path cost among all solutions.
formulation and their solutions.
OPTIMAL SOLUTION
3.1.2 Formulating problems

Problem Formulation
In the preceding section we proposed a formulation of the problem of getting to Bucharest in
terms of the initial state, actions, transition model, goal test, and path cost. This formulation
seems reasonable, but it is still a model—an abstract mathematical description—and not the
Definition 8.1.4. A problem formulation models a situation using states and
This assumption is algorithmically convenient but also theoretically justifiable—see page 649 in Chapter 17.
3
actions at an appropriate levelcostsofareabstraction.(do

The implications of negative
4 explored in Exercise 3.8. not model things like “put on my
left sock”, etc.)
it describes the initial state (we are in Arad)
8.1. PROBLEM SOLVING 83
it also limits the objectives by specifying goal states. (excludes, e.g. to stay
another couple of weeks.)
A solution is a sequence of actions that leads from the initial state to a goal state.
Problem solving computes solutions from problem formulations.
Finding the right level of abstraction and the required (not more!) information is
often the key to success.
The Math of Problem Formulation: Search Problems

Definition 8.1.5. A search problem ⟨S , A, T , I , G ⟩ consists of a set S of states, a
set A of actions, and a transition model T : A×S→P(S) that assigns to any action
a∈A and state s∈S a set of successor states.
Certain states in S are designated as goal states (G ⊆ S) and initial states I ⊆ S.
Definition 8.1.6. We say that an action a∈A is applicable in a state s∈S, iff
∅. We call T a : S→P(S) with T a (s):=T (a, s) the result relation for a
T (a, s) ̸= S
and T A := a∈A T a the result relation of Π. The graph ⟨S, T A ⟩ is called the state
space induced by Π.
Definition 8.1.7.
A solution for a search problem ⟨S , A, T , I , G ⟩ consists of a sequence a1 , . . ., an of
actions such that for all 1≤i<n
ai is applicable to state s(i−1) , where s0 ∈I,
si ∈T ai (s(i−1) ), and sn ∈G.
Idea: A solution bring us from I to a goal state.
Definition 8.1.8. Often we add a cost function c : A→R+ 0 that associates a step
cost c(a) to an action a∈A. The cost of a solution is the sum of the step costs of
its actions.
Observation:
The formulation of problems from Definition 8.1.5 uses an (black-box) state representation. It
has enough functionality to construct the state space but nothing else. We will come back to this
in slide 117.
Remark 8.1.9. Note that search problems formalize problem formulations by making many of the
implicit constraints explicit.
Structure Overview: Search Problem

The structure overview for search problems:
S set states,
* +
A set actions,
search problem = T A×S → P(S) transition model,
I S initial state,
G S goal state
We will now specialize Definition 8.1.5 to deterministic, fully observable environments, i.e. envi-
ronments where actions only have one – assured – outcome state.
Search Problems in deterministic, fully observable Environments
This semester, we will restrict ourselves to search problems, where(extend in AI II)

|T (a, s)|≤1 for the transition models and (⇝ deterministic environment)
I = {s0 } (⇝ fully observable environment)
Definition 8.1.10. Then Ta induces partial function Sa : S⇀S whose natural

domain is the set of states where a is applicable: S(s):=s′ if Ta = {s′ } and undefined
at s otherwise.
S
We call Sa the successor function for a and Sa (s) the successor state of s. SA := a∈A Sa
the successor relation of P.
Definition 8.1.11. The predicate that tests for goal states is called a goal test.
Blackbox/Declarative Problem Descriptions

Observation: ⟨S , A, T , I , G ⟩ from Definition 8.1.5 is essentially a blackbox de-
scription; it (think programming
API)
provides the functionality needed to construct a state space, but
gives the algorithm no information about the problem.
Definition 8.1.12. A declarative description (also called whitebox description)
describes the problem itself ; problem description language
Example 8.1.13 (Planning Problems as Declarative Descriptions).
The STRIPS language describes planning problems in terms of

a set P of propositional variables (propositions)
a set I ⊆ P of propositions true in the initial state.
a set G ⊆ P , where state s ⊆ P is a goal if G ⊆ s
a set A of actions, each a∈A with preconditions prea , add list adda , and delete
list dela : a is applicable, if prea ⊆ s, the result state is then s ∪ adda \dela ,
a function c that maps all actions a to their cost c(a).
8.2. PROBLEM TYPES 85
Observation 8.1.14. Declarative descriptions are strictly more powerful than black-
box descriptions: they induce blackbox descriptions, but also allow to analyze/sim-
plify the problem.
We will come back to this later ; planning.
8.2 Problem Types
Note that the definition of a search problem is very general, it applies to many many real-world
problems. So we will try to characterize these by difficulty. A Video Nugget covering this
section can be found at https://fau.tv/clip/id/21928.
Problem types
Definition 8.2.1. A search problem is called a single state problem, iff it is
fully observable (at least the initial state)
deterministic (i.e. the successor of each state is determined)
static (states do not change other than by our own actions)
discrete (a countable number of states)
Definition 8.2.2. A search problem is called a multi state problem
states partially observable (e.g. multiple initial states)

deterministic, static, discrete
Definition 8.2.3. A search problem is called a contingency problem, iff
the environment is non deterministic (solution can branch, depending on
contingencies)
the state space is unknown (like a baby, agent has to learn about states and
actions)
We will explain these problem types with another example. The problem P is very simple: We
have a vacuum cleaner and two rooms. The vacuum cleaner is in one room at a time. The floor
can be dirty or clean.
The possible states are determined by the position of the vacuum cleaner and the information,
whether each room is dirty or not. Obviously, there are eight states: S = {1, 2, 3, 4, 5, 6, 7, 8} for
simplicity.
The goal is to have both rooms clean, the vacuum cleaner can be anywhere. So the set G of
goal states is {7, 8}. In the single-state version of the problem, [right, suck] shortest solution, but
[suck, right, suck] is also one. In the multiple-state version we have
[right{2, 4, 6, 8}, suck{4, 8}, lef t{3, 7}, suck{7}]

Example: vacuum-cleaner world

70 Chapter 3. Solving Problems by Searching
Single-state Problem:
R
L R
L
S S
Start in 5 L
R
R L
R
R
L L
Solution: [right, suck] S

S S
S
R
L R
S S
Figure 3.3 The state space for the vacuum world. Links denote actions: L = Left, R =
Multiple-state Problem: Right, S = Suck.
3.2.1 Toy problems

Start in {1, 2, 3, 4, 5, 6, 7, 8} The first example we examine is the vacuum world first introduced in Chapter 2. (See
Figure 2.2.) This can be formulated as a problem as follows:
Solution: [right, suck, lef t, suck] right
• States: The→state{2, 4, 6,by8}
is determined both the agent location and the dirt locations. The
agent is in one of two locations, each of which might or might not contain dirt. Thus,
sucknthere are →
2 × 2 {4, 8}
= 8 possible 2
world states. A larger environment with n locations has
· 2 states. n
→Any{3,
lef •t Initial state: state 7}
can be designated as the initial state.
• Actions: In this simple environment, each state has just three actions: Left, Right, and
suckSuck. Larger →environments
{7} might also include Up and Down.
• Transition model: The actions have their expected effects, except that moving Left in
the leftmost square, moving Right in the rightmost square, and Sucking in a clean square
have no effect. The complete state space is shown in Figure 3.3.
• Goal test: This checks whether all the squares are clean.
Michael Kohlhase: Artificial Intelligence 2 119step costs 1, so the path cost
• Path cost: Each 2023-02-10
is the number of steps in the path.
Compared with the real world, this toy problem has discrete locations, discrete dirt, reliable
cleaning, and it never gets any dirtier. Chapter 4 relaxes some of these assumptions.
8-PUZZLE The 8-puzzle, an instance of which is shown in Figure 3.4, consists of a 3×3 board with
eight numbered tiles and a blank space. A tile adjacent to the blank space can slide into the
Example: Vacuum-Cleaner World (continued) space. The object is to reach a specified goal state, such as the one shown on the right of the
figure. The standard formulation is as follows:
Contingency Problem:
Murphy’s Law: suck can dirty a clean
carpet 70 Chapter 3. Solving Problems by Searching
Local sensing: dirty/notdirty at lo-

R
L R
cation only
L
S S
R R
L R L R
Start in: {1, 3} S

L
S S
L
S
R
Solution:
L R
[suck, right, suck] L
suck → {5, 7} S S
right → {6, 8} Right, S = Suck.
suck → {6, 8} 3.2.1 Toy problems

The first example we examine is the vacuum world first introduced in Chapter 2. (See
• States: The state is determined by both the agent location and the dirt locations. The
better: [suck, right, if dirt then suck] (decide whether in 6 or 8 using local
there are 2 × 22 = 8 possible world states. A larger environment with n locations has
sensing) n · 2n states.
• Initial state: Any state can be designated as the initial state.
Suck. Larger environments might also include Up and Down.
Michael Kohlhase: Artificial Intelligence 2 have no 120
effect. The complete state space is2023-02-10
shown in Figure 3.3.
• Path cost: Each step costs 1, so the path cost is the number of steps in the path.
In the contingency version of P a solution is the following: 8-PUZZLE
The 8-puzzle, an instance of which is shown in Figure 3.4, consists of a 3×3 board with
space. The object is to reach a specified goal state, such as the one shown on the right of the
[suck{5, 7}, right → {6, 8}, suck → {6, 8}, suck{5, 7}] figure. The standard formulation is as follows:
etc. Of course, local sensing can help: narrow {6, 8} to {6} or {8}, if we are in the first, then
suck.
Single-state problem formulation

8.2. PROBLEM TYPES 87
Defined by the following four items

1. Initial state: (e.g. SArad)
2. Successor function S: (e.g. S(SArad) = {⟨goZer, Zerind⟩, ⟨goSib, Sibiu⟩, . . . })
3. Goal test: (e.g. x = SBucharest (explicit test) )
noDirt(x) (implicit test)
4. Path cost (optional):(e.g. sum of distances, number of operators executed, etc.)
Solution: A sequence of actions leading from the initial state to a goal state.
“Path cost”: There may be more than one solution and we might want to have the “best” one in
a certain sense.
Selecting a state space

Abstraction: Real world is absurdly complex!
State space must be abstracted for problem solving.
(Abstract) state: Set of real states.
(Abstract) operator: Complex combination of real actions.
Example: Arad → Zerind represents complex set of possible routes.

(Abstract) solution: Set of real paths that are solutions in the real world.
“State”: e.g., we don’t care about tourist attractions found in the cities along the way. But this is
problem dependent. In a different problem it may well be appropriate to include such information
in the notion of state.
“Realizability”: one could also say that the abstraction must be sound wrt. reality.
Example:
Section 3.2. The
Example 8-puzzle
Problems 71
7 2 4 1 2
5 6 3 4 5
8 3 1 6 7 8
Start State Goal State

States integer locations of tiles
Figure 3.4 A typical instance of the
Actions 8-puzzle.
lef t, right, up, down
Goal test = goal state?
• States: A state description specifies the location of each of the eight tiles and the blank
Path cost 1 per move
in one of the nine squares.
• Initial state: Any state can be designated as the initial state. Note that any given goal
can be reached from exactly half of the possible initial states (Exercise 3.4).
• Actions: The simplest formulation defines the actions as movements of the blank space
Left, Right, Up, or Down. Different subsets of these are possible depending on where
How many states are there? N factorial, so it is not obvious that the problem is in NP. One
the blank is.
needs to show, for example, that polynomial length solutions do always exist. Can be done by
• Transition model: Given a state and action, this returns the resulting state; for example,
if we apply Left to the start state in Figure 3.4, the resulting state has the 5 and the blank
switched.
• Goal test: This checks whether the state matches the goal configuration shown in Fig-
combinatorial arguments on state space graph (really ?).

Some rule-books give a different goal state for the 8-puzzle: starting with 1, 2, 3 in the top row
and having the hold in the lower right corner. This is completely irrelevant for the example and
its significance to AI-2.
36 Example: Vacuum-cleaner Chapter 2. Intelligent Agents
A B
States integer dirt and robot locations

Figure 2.2 A vacuum-cleaner world with just two locations.
Actions lef t, right, suck, noOp
Goal test
Percept sequence notdirty? Action
[A,Path
Clean]cost 1 per operation (0 for noOp) Right
[A, Dirty] Suck
[B, Clean] Left
[B, Dirty] Suck
[A, Clean], [A, Clean] Right
[A, Clean], [A, Dirty] Suck
.. Michael Kohlhase: Artificial Intelligence 2 124 ..
2023-02-10
. .
[A, Clean], [A, Clean], [A, Clean] Right
[A, Clean], [A, Clean], [A, Dirty] Suck
Example: Robotic assembly ..
.
..
.
Figure 2.3 Partial tabulation of a simple agent function for the vacuum-cleaner world
shown in Figure 2.2.
Before closing this section, we should emphasize that the notion of an agent is meant to
be a tool for analyzing systems, not an absolute characterization that divides the world into
agents and non-agents. One could view a hand-held calculator as an agent that chooses the
action of displaying “4” when given the percept sequence “2 + 2 =,” but such an analysis
would hardly aid our understanding of the calculator. In a sense, all areas of engineering can
be seen as designing artifacts that interact with the world; AI operates at (what the authors
consider to be) the most interesting end of the spectrum, where the artifacts have significant
computational resources and the task environment requires nontrivial decision making.
2.2 G OOD B EHAVIOR : T HE C ONCEPT OF R ATIONALITY
RATIONAL AGENT A rational agent is one that does the right thing—conceptually speaking, every entry in the
States real-valued
table for the agent coordinates
function is filled of Obviously, doing the right thing is better
out correctly.
than doing the wrong
robotthing,
jointbutangles
what does it mean
and partsto do
ofthe
theright thing? to be assembled
object
Actions continuous motions of robot joints
Goal test assembly complete?
Path cost time to execute
General Problems
Question: Which are “Problems”?
8.3. SEARCH 89
(A) You didn’t understand any of the lecture.

(B) Your bus today will probably be late.
(C) Your vacuum cleaner wants to clean your apartment.
(D) You want to win a chess game.
8.3 Search
Tree Search Algorithms

Note: The state space of a search problem ⟨S , A, T , I , G ⟩ is a graph ⟨S, TA ⟩.
As graphs are difficult to compute with, we often compute a corresponding tree

and work on that. (standard trick in graph algorithms)
Definition 8.3.1. Given a search problem P:=⟨S , A, T , I , G ⟩, the tree search
algorithm consists of the simulated exploration of state space ⟨S, TA ⟩ in a search
tree formed by successively expanding already explored states. (offline algorithm)
procedure Tree−Search (problem, strategy) : <a solution or failure>
<initialize the search tree using the initial state of problem>
loop
if <there are no candidates for expansion> <return failure> end if
<choose a leaf node for expansion according to strategy>
if <the node contains a goal state> return <the corresponding solution>
else <expand the node and add the resulting nodes to the search tree>
end if
end loop
end procedure
We expand a node n by generating all successors of n and inserting them as children

of n in the search tree.
Tree Search: Example
Arad
Sibiu Timisoara Zerind
Arad Fagaras Oradea R. Vilcea Arad Lugoj Oradea Arad

Arad
Arad
Arad
Implementation: States vs. nodes

Recap: A state is a (representation of) a physical configuration.
Remark: The nodes of a search tree are implemented as a data structure that
Sectionincludes
3.3. Searching
accessorsfor Solutions
for parent,children, depth, path cost, etc. 79
PARENT
5 4 Node ACTION = Right

PATH-COST = 6
6 1 88
STATE
7 3 22
Figure 3.10 Nodes are the data structures from which the search tree is constructed. Each
has a parent,
Observation: Pathsa state,
inand
thevarious bookkeeping
search fields. Arrows to
tree correspond pointpaths
from child to parent.
in the state space.
Definition 8.3.2.
Given the We define
components the node,
for a parent pathitcost of toa see
is easy node
howntoin a search
compute tree T to be
the necessary
components
the sum for a child
of the step costsnode.
on The
the function C HILDn-Nto
path from ODE takes a parent node and an action
the root of T .
and returns the resulting child node:

function C HILD -N ODE( problem, parent , action) returns a node
return a node with
S TATE = problem.R ESULT(parent.S TATE, action ),
Implementation of Search Algorithms
PARENT = parent , ACTION = action,
PATH -C OST = parent .PATH -C OST + problem.S TEP -C OST(parent.S TATE, action )
procedure Tree_Search (problem,strategy)

The node data structure is depicted in Figure 3.10. Notice how the PARENT pointers
fringe := insert(make_node(initial_state(problem)))
string the nodes together into a tree structure. These pointers also allow the solution path to be
extracted when a goal node is found; we use the S OLUTION function to return the sequence
of actions obtained by following parent pointers back to the root.
Up to now, we have not been very careful to distinguish between nodes and states, but in
writing detailed algorithms it’s important to make that distinction. A node is a bookkeeping
data structure used to represent the search tree. A state corresponds to a configuration of the
8.3. SEARCH 91
loop
if fringe <is empty> fail end if
node := first(fringe,strategy)
if NodeTest(State(node)) return State(node)
else fringe := insert_all(expand(node,problem),strategy)
end if
end loop
end procedure
Definition 8.3.3. The fringe is a list nodes not yet considered in tree search.

It is ordered by the strategy. (see below)
• State gives the state that is represented by node

• Expand = creates new nodes by applying possible actions to node
• Make-Queue creates a queue with the given elements.
• fringe holds the queue of nodes not yet considered.

• Remove-First returns first element of queue and as a side effect removes it from fringe.
• State gives the state that is represented by node.
• Expand applies all operators of the problem to the current node and yields a set of new nodes.
• Insert inserts an element into the current fringe queue. This can change the behavior of the
search.
• Insert-All Perform Insert on set of elements.
Search strategies
Definition 8.3.4. A strategy is a function that picks a node from the fringe of a
search tree. (equivalently, orders the fringe and picks the first.)
Definition 8.3.5 (Important Properties of Strategies).
completeness does it always find a solution if one exists?

time complexity number of nodes generated/expanded
space complexity maximum number of nodes in memory
optimality does it always find a least cost solution?
Time and space complexity measured in terms of:
b maximum branching factor of the search tree

d minimal graph depth of a solution in the search tree
m maximum graph depth of the search tree (may be ∞)
Complexity means here always worst-case complexity!

Note that there can be infinite branches, see the search tree for Romania.
8.4 Uninformed Search Strategies

Uninformed search strategies

Definition 8.4.1. We speak of an uninformed search algorithm, if it only uses the
information available in the problem definition.
Next:
Frequently used search algorithms
Breadth first search

Uniform cost search
Depth first search
Depth limited search
Iterative deepening search
The opposite of uninformed search is informed or heuristic search that uses a heuristicheuristic
function that adds external guidance to the search process. In the Romania example, one could
add the heuristic to prefer cities that lie in the general direction of the goal (here SE).
Even though heuristic search is usually much more efficient, uninformed search is important
nonetheless, because many problems do not allow to extract good heuristics.
8.4.1 Breadth-First Search Strategies
Breadth-First Search
Idea: Expand the shallowest unexpanded node.
Definition 8.4.2. The breadth first search (BFS) strategy treats the fringe as a
FIFO queue, i.e. successors go in at the end of the fringe.
Example 8.4.3 (Synthetic).
B C
D E F G
H I J K L M N O
8.4. UNINFORMED SEARCH STRATEGIES 93
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
We will now apply the breadth first search strategy to our running example: Traveling in Romania.
Note that we leave out the green dashed nodes that allow us a preview over what the search tree
will look like (if expanded). This gives a much cleaner picture we assume that the readers already
have grasped the mechanism sufficiently.
Breadth-First Search: Romania
Arad
Arad
Arad
Arad Fagaras Oradea R. Vilcea
Arad
Arad Fagaras Oradea R. Vilcea Arad Lugoj
Arad
Breadth-first search: Properties
Completeness Yes (if b is finite)

Time complexity 1 + b + b2 + b3 + . . . + bd , so O(bd ), i.e. exponential in d
Space complexity O(bd ) (fringe may be whole level)
Optimality Yes (if cost = 1 per step), not optimal in general
Disadvantage: Space is the big problem (can easily generate nodes at

500MB/sec =b 1.8TB/h)
Optimal?: No! If cost varies for different steps, there might be better solutions
below the level of the first one.
An alternative is to generate all solutions and then pick an optimal one. This works
only, if m is finite.
The next idea is to let cost drive the search. For this, we will need a non-trivial cost function: we
will take the distance between cities, since this is very natural. Alternatives would be the driving
time, train ticket cost, or the number of tourist attractions along the way.
Of course we need to update our problem formulation with the necessary information.
68 Romania with Step Costs as Distances

Chapter 3. Solving Problems by Searching
Oradea
71
Neamt
Zerind 87
75 151
Iasi
Arad
140
92
Sibiu Fagaras
99
118
Vaslui
80
Rimnicu Vilcea
Timisoara
142
111 Pitesti 211
Lugoj 97
70 98
85 Hirsova
75 138 86
Bucharest
Drobeta 120
90
Craiova Eforie
Giurgiu

Uniform-cost search
where the opponent’s king is under attack and can’t escape.
PATH COST • A path cost function that assigns a numeric cost to each path. The problem-solving
Idea: agent chooses a cost function that reflects its own performance measure. For the agent
Expand least cost unexpanded node.
in kilometers.
Definition 8.4.4.InUniform-cost
this chapter, we search
assume that the cost
(UCS) of a path
is the can bewhere
strategy described
theasfringe
the is
sum of the costs of the individual actions along the path.3 The step cost of taking action
STEP COST
ordered by increasing path cost. ! !
a in state s to reach state s is denoted by c(s, a, s ). The step costs for Romania are
Note: shown in Figure 3.2 as route distances. We assume that step costs are nonnegative.4
The preceding
Equivalent elements first
to breadth definesearch
a problem
if allandstep
can costs
be gathered into a single data structure
are equal.
sequence thatExample:
Synthetic leads from the initial state to a goal state. Solution quality is measured by the
OPTIMAL SOLUTION path cost function, and an optimal solution has the lowest path cost among all solutions.
3.1.2 Formulating problems Arad

Arad
seems reasonable, but it is still a model—an abstract mathematical description—and not the
140 118 75
3 This assumption is algorithmically convenient but also theoretically justifiable—see page 649 in Chapter 17.
4 of negative costs are explored in ExerciseTimisoara
The implicationsSibiu 3.8. Zerind
Arad
140 118 75
71 75
Oradea Arad
Arad
140 118 75
118 111 71 75
Arad Lugoj Oradea Arad
Arad
140 118 75
140 99 151 80 118 111 71 75
Note that we must sum the distances to each leaf. That is, we go back to the first level after the
third step.
Uniform-cost search: Properties
Completeness Yes (if step costs ≥ ϵ > 0)

Time complexity number of nodes with path cost less than that of optimal solution
Space complexity number of nodes with path cost less than that of optimal solution
Optimality Yes
If step cost is negative, the same situation as in breadth first search can occur: later solutions
may be cheaper than the current one.
If step cost is 0, one can run into infinite branches. UCS then degenerates into depth first
search, the next kind of search algorithm we will encounter. Even if we have infinite branches,
where the sum of step costs converges, we can get into trouble, since the search is forced down
these infinite paths before a solution can be found.
Worst case is often worse than BFS, because large trees with small steps tend to be searched
first. If step costs are uniform, it degenerates to BFS.
8.4.2 Depth-First Search Strategies
Depth-first Search
Idea: Expand deepest unexpanded node.
Definition 8.4.5. Depth-first search (DFS) is the strategy where the fringe is
organized as a (LIFO) stack i.e. successor go in at front of the fringe.
Note: Depth first search can perform infinite cyclic excursions
Need a finite, non cyclic state space (or repeated state checking)
Depth-First Search
Example 8.4.6 (Synthetic).
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O

Depth-First Search: Romania

Example 8.4.7 (Romania).
Arad
Arad
Arad
Arad
Depth-first search: Properties
Completeness Yes: if state space finite

No: if search tree contains infinite paths or loops
Time complexity O(bm )
(we need to explore until max depth m in any case!)
Space complexity O(bm) (i.e. linear space)
(need at most store m levels and at each level at most b nodes)
Optimality No (there can be many better solutions in the
unexplored part of the search tree)
Disadvantage: Time terrible if m much larger than d.

Advantage: Time may be much less than breadth first search if solutions are
dense.

Iterative deepening search

Definition 8.4.8. Depth limited search is depth first search with a depth limit.
Definition 8.4.9. Iterative deepening search (IDS) is depth limited search with
ever increasing depth limits.
procedure Tree_Search (problem)

<initialize the search tree using the initial state of problem>
for depth = 0 to ∞
result := Depth_Limited_search(problem,depth)
if depth notequal cutoff return result end if
end for
end procedure
Ilustration: Iterative Deepening Search at various Limit Depths
A A
A A A A
B C B C B C B C
A A A A
B C B C B C B C
D E F G D E F G D E F G D E F G
A A A A
B C B C B C B C
D E F G D E F G D E F G D E F G
Iterative deepening search: Properties
Completeness Yes
Time complexity (d + 1) · b0 + d · b1 + (d − 1) · b2 + . . . + bd ∈O(bd+1 )
Space complexity O(b · d)
Optimality Yes (if step cost = 1)
Consequence: IDS used in practice for search spaces of large, infinite, or unknown
depth.
Note:
To find a solution (at depth d) we have to search the whole tree up to d. Of course since we
do not save the search state, we have to re-compute the upper part of the tree for the next level.
This seems like a great waste of resources at first, however, IDS tries to be complete without the
space penalties.
However, the space complexity is as good as DFS, since we are using DFS along the way. Like
in BFS, the whole tree on level d (of optimal solution) is explored, so optimality is inherited from
there. Like BFS, one can modify this to incorporate uniform cost search behavior.
As a consequence, variants of IDS are the method of choice if we do not have additional
information.
Comparison BFS (optimal) and IDS (not)

Example 8.4.10. IDS may fail to be be optimal at step sizes > 1.
Comparison
Comparison
Breadth-first search Iterative deepening search
Breadth-first search Iterative deepening search
Breadth first search Iterative deepening search
Kohlhase:
Kohlhase:Künstliche
KünstlicheIntelligenz 1 1
Intelligenz 150150 JulyJuly
5, 2018
5, 2018
8.4.3 Further Topics
Tree Search vs. Graph Search

We have only covered tree search algorithms.
States duplicated in nodes are a huge problem for efficiency.
Definition 8.4.11. A graph search algorithm is a variant of a tree search algorithm

that prunes nodes whose state has already been considered (duplicate pruning),
essentially using a DAG data structure.
Observation 8.4.12. Tree search is memory intensive it has to store the fringe so
keeping a list of “explored states” does not lose much.
Graph versions of all the tree search algorithms considered here exist, but are more
difficult to understand (and to prove properties about).
The (time complexity) properties are largely stable under duplicate pruning. (no
gain in the worst case)
Definition 8.4.13. We speak of a search algorithm, when we do not want to

distinguish whether it is a tree or graph search algorithm. (difference considered in
implementation detail)
Uninformed Search Summary

Tree/Graph Search Algorithms: Systematically explore the state tree/graph
induced by a search problem in search of a goal state. Search strategies only differ
by the treatment of the fringe.
Search Strategies and their Properties: We have discussed
Breadth Uniform Depth Iterative

Criterion first cost first deepening
Completeness Yes1 Yes2 No Yes
Time complexity bd ≈ bd bm bd+1
Space complexity bd ≈ bd bm bd
Optimality Yes∗ Yes No Yes∗
1 2
Conditions b finite 0<ϵ≤cost
Search Strategies; the XKCD Take
More Search Strategies?: (from https://xkcd.com/2407/)
8.5 Informed Search Strategies
Summary: Uninformed Search/Informed Search

Problem formulation usually requires abstracting away real-world details to define
a state space that can feasibly be explored.
Variety of uninformed search strategies.
Iterative deepening search uses only linear space and not much more time than
8.5. INFORMED SEARCH STRATEGIES 105
other uninformed algorithms.

Next Step: Introduce additional knowledge about the problem(heuristic search)
Best-first-, A∗ -strategies (guide the search by heuristics)
Iterative improvement algorithms.
8.5.1 Greedy Search

A Video Nugget covering this subsection can be found at https://fau.tv/clip/id/22015.
Best-first search
Idea: Order the fringe by estimated “desirability” (Expand most desirable
unexpanded node)
Definition 8.5.1. An evaluation function assigns a desirability value to each node
of the search tree.
Note: A evaluation function is not part of the search problem, but must be added
externally.
Definition 8.5.2. In best first search, the fringe is a queue sorted in decreasing
order of desirability.
Special cases: Greedy search, A∗ search
This is like UCS, but with evaluation function related to problem at hand replacing the path cost
function.
If the heuristics is arbitrary, we expect incompleteness!
Depends on how we measure “desirability”.
Concrete examples follow.
Greedy search
Idea: Expand the node that appears to be closest to the goal.
Definition 8.5.3. A heuristic is an evaluation function h on states that estimates

the cost from n to the nearest goal state.
Note: All nodes for the same state, must have the same h-value!
Definition 8.5.4. Given a heuristic h, greedy search is the strategy where the
fringe is organized as a queue sorted by decreasing h value.
Example 8.5.5. Straight-line distance from/to Bucharest.

Note:
Unlike uniform cost search the node evaluation function has nothing to do with the
nodes explored so far
internal search control ; external search control

partial solution cost ; goal cost estimation
In greedy search we replace the objective cost to construct the current solution with a heuristic or
subjective measure from which we think it gives a good idea how far we are from a solution. Two
things have shifted:
• we went from internal (determined only by features inherent in the search space) to an exter-
nal/heuristic cost
• instead of measuring the cost to build the current partial solution, we estimate how far we are
from the desired goal
Romania with Straight-Line Distances

Example 8.5.6 (Informed Travel). hSLD (n) = straight-line distance to Bucharest
Arad 366 Mehadia 241 Bucharest 0 Neamt 234

Craiova 160 Oradea 380 Drobeta 242 Pitesti 100
Eforie 161 Rimnicu Vilcea 193 Fragaras 176 Sibiu 253
Giurgiu 77 Timisoara 329 Hirsova 151 Urziceni 80
68 Iasi 226 Vaslui 199
ChapterLugoj
3. 244
Solving Zerind
Problems 374
by Searching
Oradea
71
Neamt
Zerind 87
75 151
Iasi
Arad
140
92
Sibiu Fagaras
99
118
Vaslui
80
Rimnicu Vilcea
Timisoara
142
111 Pitesti 211
Lugoj 97
70 98
85 Hirsova
75 138 86
Bucharest
Drobeta 120
90
Craiova Eforie
Giurgiu

Greedy Search: Romania
in kilometers. In this chapter, we assume that the cost of a path can be described as the
Arad
! 366 by c(s, a, s! ). The step costs for Romania are
a in state s to reach state s is denoted
shown in Figure 3.2 as route distances. We assume that step costs are nonnegative.4
The preceding elements define a problem and can be gathered into a single data structure
sequence that leads from the initial state to a goal state. Solution quality is measured by the
OPTIMAL SOLUTION path cost function, and an optimal solution has the lowest path cost among all solutions.
3.1.2 Formulating problems

Arad
366
253 329 374
Arad
366
253 329 374
366 176 380 193
Arad
366
253 329 374
366 176 380 193
Sibiu Bucharest
253 0
Heuristic
HeuristicFunctions
FunctionsininPath
PathPlanning
Planning
Example 8.5.7 (The maze solved).
I We
Example
indicate4.4
h∗ (The maze
by giving thesolved).
goal distance
We indicate h⇤ by giving the goal distance
I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 24 18 17 16 15 14 12 13 14 15 16 17 18
2 23 19 18 17 13 12 11 12 13 14 15 16 17
3 22 21 20 16 12 11 10
4 23 22 21 15 14 13 9 8 4 3 2 1
5 24 23 22 16 15 9 8 7 6 5 1 0
G
I Example 4.5 (Maze Heuristic: the good case).
Example 8.5.8
We use the (Maze Heuristic:
Manhattan distance to the goodascase).
the goal a heuristic
We use the Manhattan distance to the goal as a heuristic
Kohlhase: Künstliche Intelligenz 1 160 July 5, 2018

Heuristic Functions in Path Planning
I Example 4.4 (The maze solved).

108 I Example 4.5 (Maze Heuristic:CHAPTER 8. PROBLEM SOLVING AND SEARCH
the good case).
I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 18 16 15 14 13 12 10 9 8 7 6 5 4
2 17 15 14 13 11 10 9 8 7 6 5 4 3
3 16 15 14 12
Heuristic Functions in Path10Planning
9 8
4 15 14 13 11 10 9 7 6 4 3 2 1
I Example
5 144.413(The 12 maze10solved). 9 7 6 5 4 3 1 0
⇤
We indicate h by giving the goal distance
I Example 4.5 (Maze Heuristic: the good case).
G
Example 8.5.9 (Maze Heuristic: the bad case).
IWe
Example
use the 4.6 (Maze Heuristic:
Manhattan distance to the
the bad
goal case).
as a heuristic again
We use the Manhattan distance to the goal
Kohlhase: Künstliche Intelligenz 1 160
as a heuristicJuly
again
5, 2018
I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4
2 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3
3 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
4 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
5 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
G
Greedy search: Properties
Completeness No: Can get stuck in loops

Complete in finite space with repeated state checking
Time complexity O(bm )
Space complexity O(bm )
Optimality No
Example 8.5.10. Greedy search can get stuck going from Iasi to Oradea:
Iasi → Neamt
68 → Iasi → Neamt → · · · Chapter 3. Solving Problems by Searching
Oradea
71
Neamt
Zerind 87
75 151
Iasi
Arad
140
92
Sibiu Fagaras
99
118
Vaslui
80
Rimnicu Vilcea
Timisoara
142
111 Pitesti 211
Lugoj 97
70 98
85 Hirsova
75 138 86
Bucharest
Drobeta 120
90
Craiova Eforie
Giurgiu
in kilometers. In this chapter, we assume that the cost of a path can be described as the
Worst-case Time:
Same as depth first search.
Worst-case Space:
Same as breadth first search.
But: A good heuristic can give dramatic improvements.
Remark 8.5.11.
Greedy Search is similar to UCS. Unlike the latter, the node evaluation function has nothing to
do with the nodes explored so far. This can prevent nodes from being enumerated systematically
as they are in UCS and BFS.
For completeness, we need repeated state checking as the example shows. This enforces com-
plete enumeration of state space (provided that it is finite), and thus gives us completeness.
Note that nothing prevents from all nodes being searched in worst case; e.g. if the heuristic
function gives us the same (low) estimate on all nodes except where the heuristic mis-estimates
the distance to be high. So in the worst case, greedy search is even worse than BFS, where d
(depth of first solution) replaces m.
The search procedure cannot be optimal, since actual cost of solution is not considered.
For both, completeness and optimality, therefore, it is necessary to take the actual cost of
partial solutions, i.e. the path cost, into account. This way, paths that are known to be expensive
are avoided.
8.5.2 Heuristics and their Properties

Heuristic Functions
Definition 8.5.12. Let Π be a problem with states S. A heuristic function (or
short heuristic) for Π is a function h : S→R+
0 ∪ {∞} so that h(s) = 0 whenever s
is a goal state.
h(s) is intended as an estimate between state s and the nearest goal state.
Definition 8.5.13. Let Π be a problem with states S, then the function h∗ : S→R+ 0∪
{∞}, where h∗ (s) is the cost of a cheapest path from s to a goal state, or ∞ if no
such path exists, is called the goal distance function for Π.
Notes:
h(s) = 0 on goal states: If your estimator returns “I think it’s still a long way”
on a goal state, then its “intelligence” is, um . . .
Return value ∞: To indicate dead ends, from which the goal can’t be reached
anymore.
The distance estimate depends only on the state s, not on the node (i.e., the
path we took to reach s).

Where does the word “Heuristic” come from?

Ancient Greek word ϵυρισκϵιν (=
b “I find”) (aka. ϵυρϵκα!)
Popularized in modern science by George Polya: “How to solve it” [Pól73]
same word often used for “rule of thumb” or “imprecise solution method”.
Heuristic Functions: The Eternal Trade-Off

“Distance Estimate”? (h is an arbitrary function in principle)
In practice, we want it to be accurate (aka: informative), i.e., close to the actual

goal distance.
We also want it to be fast, i.e., a small overhead for computing h.
These two wishes are in contradiction!
Example 8.5.14 (Extreme cases).

h = 0: no overhead at all, completely un-informative.
h = h∗ : perfectly accurate, overhead =
b solving the problem in the first place.
Observation 8.5.15. We need to trade off the accuracy of h against the overhead
for computing it.
Properties of Heuristic Functions

Definition 8.5.16. Let Π be a search problem with states S and actions A. We
say that a heuristic h for Π is admissible if h(s)≤h∗ (s) for all s∈S.
We say that h is consistent if h(s) − h(s′ )≤c(a) for all s∈S and a∈A.
In other words . . . :
h is admissible if it is a lower bound on goal distance.

h is consistent if, when applying an action a, the heuristic value cannot decrease
by more than the cost of a.
Properties of Heuristic Functions, ctd.

Let Π be a problem, and let h be a heuristic for Π. If h is consistent, then h is
admissible.
Proof: we prove h(s)≤h∗ (s) for all s∈S by induction over the length of the cheapest path
to a goal state.
1. base case
1.1. h(s) = 0 by definition of heuristic, so h(s)≤h∗ (s) as desired.
2. step case
2.1. We assume that h(s′ )≤h∗ (s) for all states s′ with a cheapest goal path of length
n.
2.2. Let s be a state whose cheapest goal path has length n+1 and the first transition
is o = (s,s′ ).
2.3. By consistency, we have h(s) − h(s′ )≤c(o) and thus h(s)≤h(s′ ) + c(o).
2.4. By construction, h∗ (s) has a cheapest goal path of length n and thus, by induc-
tion hypothesis h(s′ )≤h∗ (s′ ).
2.5. By construction, h∗ (s) = h∗ (s′ ) + c(o).
2.6. Together this gives us h(s)≤h∗ (s) as desired.
Consistency is a sufficient condition for admissibility (easier to check)
Properties of Heuristic Functions: Examples

Example 8.5.17. Straight line distance is admissible and consistent by the
triangle inequality.
If you drive 100km, then the straight line distance to Rome can’t decrease by more
than 100km.
Observation: In practice, admissible heuristics are typically consistent.
Example 8.5.18 (An admissible, but inconsistent heuristic).

When traveling to Rome, let h(M unich) = 300 and h(Innsbruck) = 100.
Inadmissible heuristics: typically arise as approximations of admissible heuristics
that are too costly to compute. (see later)
8.5.3 A-Star Search

A∗ Search: Evaluation Function

Idea: Avoid expanding paths that are already expensive(make use of actual cost)
The simplest way to combine heuristic and path cost is to simply add them.
Definition 8.5.19. The evaluation function for A∗ search is given by f (n) =
g(n) + h(n), where g(n) is the path cost for n and h(n) is the estimated cost to
goal from n.
Thus f (n) is the estimated total cost of the path through n to a goal.
Definition 8.5.20. Best first search with evaluation function g + h is called A∗
search.
This works, provided that h does not overestimate the true cost to achieve the goal. In other
words, h must be optimistic wrt. the real cost h∗ . If we are too pessimistic, then non-optimal
solutions have a chance.
A∗ Search: Optimality
Theorem 8.5.21. A∗ search with admissible heuristic is optimal.
Proof: We show that sub-optimal nodes are never selected by A∗
1. Suppose a suboptimal goal G has been generated then we are in the following
situation:
start
n
O G
2. Let n be an unexpanded node on a path to an optimality goal O, then

f (G) = g(G) since h(G) = 0
g(G) > g(O) since G suboptimal
g(O) = g(n) + h∗ (n) n on optimal path
g(n) + h∗ (n) ≥ g(n) + h(n) since h is admissible
g(n) + h(n) = f (n)
3. Thus, f (G) > f (n) and A∗ never selects G for expansion.
A∗ Search Example
Arad
366=0+366
Arad

393=140+253 447=118+329 449=75+374
Arad

447=118+329 449=75+374
646=280+366 415=239+176 671=291+380 413=220+193

Arad

447=118+329 449=75+374
646=280+366 415=239+176 671=291+380
Craiova Pitesti Sibiu
526=366+160 417=317+100 553=300+253
Arad

447=118+329 449=75+374
646=280+366 671=291+380
Sibiu Bucharest Craiova Pitesti Sibiu
591=338+253 450=450+0 526=366+160 417=317+100 553=300+253
Arad

447=118+329 449=75+374
646=280+366 671=291+380
Sibiu Bucharest Craiova Pitesti Sibiu
591=338+253 450=450+0 526=366+160 553=300+253
Bucharest Craiova Sibiu
418=418+0 615=455+160 607=414+193

Additional
I Example Observations (Not Limited to Path Planning)
4.4 (The maze solved).
I Example
Example 8.5.22
4.5 (Maze Heuristic:
(Greedy thesearch,
best-first good case).
“good case”).
I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 18 16 15 14 13 12 10 9 8 7 6 5 4
2 17 15 14 13 11 10 9 8 7 6 5 4 3
3 16 15 14 12 10 9 8
4 15 14 13 11 10 9 7 6 4 3 2 1
5 14 13 12 10 9 7 6 5 4 3 1 0

We will find a solution with little search.
Additional Observations (Not Limited to Path Planning)

Additional Observations (Not Limited to Path Planning) II
Example 8.5.23 (A∗ (g + h), “good case”).
I Example 4.21 (A⇤ (g + h), “good case”).
I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 18 22 22 22 22 22 24 24 24 24 24 24 24
2 18 20 20 20 22 22 22 22 22 22 22 22 22
3 18 18 18 20 22 22 22
4 18 18 18 20 20 20 22 22 24 24 24 24
5 18 18 18 20 20 24 22 22 22 22 24 24
G
I A⇤ with
∗ a consistent heuristic g + h always increases monotonically (h cannot
In A with a consistent heuristic, g + h always increases monotonically (h
decrease mor than g increases)
cannot decrease more than g increases)
I We need more search, in the “right upper half”. This is typical: Greedy best-first
We need
search more
tends to besearch, in the
faster than A⇤“right
. upper half”. This is typical: Greedy best
∗
first search tends to be faster than A .
I Example 4.4 (The maze solved).
Additional Observations
I Example 4.5 (NottheLimited
(Maze Heuristic: to Path Planning)
good case).
I Example
Example 8.5.24
4.6 (Maze Heuristic:
(Greedy thesearch,
best-first bad case).
“bad case”).
We use the Manhattan distance to the goal as a heuristic again
I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4
2 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3
3 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
4 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
5 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
G
Search will be mis-guided into the “dead-end street”.

Additional Observations (Not Limited to Path Planning)

Additional Observations (Not Limited to Path Planning) IV
Example 8.5.25 (A ∗
(g + h), “bad case”).
I Example 4.23 (A⇤ (g + h), “bad case”).
I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 18 17 24 24 24 24 24 24 24 24 24 24 24 24 24
2 18 16 22 14 13 12 11 10 9 8 7 6 5 4 24
3 18 15 20 13 22 22 22 9 26 26 26 5 30 3 24
4 18 18 18 12 20 10 22 8 24 6 26 4 28 2 24
5 18 18 18 18 18 9 22 22 22 5 26 26 26 1 24
G
We will search less of the “dead-end street”. Sometimes g + h gives better
We will search
search lessthan
guidance of the
h. “dead-end street”. Sometimes g +
(;h gives
A⇤ is better search
faster there)
guidance than h. (; A∗ is faster there)
AdditionalObservations
Additional Observations(Not
(NotLimited
LimitedtotoPath
PathPlanning)
Planning) V
Example 8.5.26 (A∗ (g + h) using h∗ ).

I Example 4.24 (A⇤ (g + h) using h⇤ ).
I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 24 17 24 24 24 24 24 24 24 24 24 24 24 24 24
2 24 16 24 14 13 12 11 10 9 8 7 6 5 4 24
3 24 15 24 13 34 36 38 9 50 52 54 5 66 3 24
4 24 24 24 12 32 10 40 8 48 6 56 4 64 2 24
5 26 26 26 28 30 9 42 44 46 5 58 60 62 1 24
G
⇤
In A , node values always increase monotonically (with any heuristic). If the
A∗ , node
Inheuristic is values
perfect,always increaseconstant
they remain monotonically (withpaths.
on optimal any heuristic). If the heuris-
tic is perfect, they remain constant on optimal paths.

A∗ search: f -contours
A∗ gradually adds “f -contours” of nodes
Section 3.5. Informed (Heuristic) Search Strategies 97
Z N
A I
380 S
F
V
400
T R
L P
H
M U
420 B
D
C E
G
Figure 3.25 MapMichaelof Romania showing

Kohlhase: Artificial contours
Intelligence 2 at f =171 and f = 420, with
380, f = 400,2023-02-10
Arad as the start state. Nodes inside a given contour have f -costs less than or equal to the
contour value.
A∗ search: Properties
Figure 3.9; because f is nondecreasing along any path, n! would have lower f -cost than n
andCompleteness
would have been selected first.
From the two preceding observations,
Yes (unless there areit infinitely
follows that
manythenodes
sequence
n withof nodes expanded
f (n)≤f (0))
by A ∗ using G RAPH -S EARCH is in nondecreasing order of f (n). Hence, the first goal node
Time complexity Exponential in [relative error in h × length of solution]
selected
Spacefor expansion must
complexity Samebeasantime optimal (variant solution because f is the true cost for goal nodes
of BFS)
(which have
Optimalityh = 0) and all
Yes later goal nodes will be at least as expensive.
The fact that f -costs are nondecreasing along any path also means that we can draw
CONTOUR contours
A∗ expands
in the state
all space, just like
(some/no) nodes thewith contours in a∗ (n)
f (n)<h topographic map. Figure 3.25 shows
an example. Inside the contour labeled 400, all nodes have f (n) less than or equal to 400,
The run-time depends on how good we approximated the real cost h∗ with h.
and so on. Then, because A∗ expands the frontier node of lowest f -cost, we can see that an
A∗ search fans out from the start node, adding nodes in concentric bands of increasing f -cost.
With uniform-costMichael
search (A
Kohlhase:
∗ search
Artificial using
Intelligence 2 h(n) =172 0), the bands will be “circular”
2023-02-10
around the start state. With more accurate heuristics, the bands will stretch toward the goal
and become more narrowly focused around the optimal path. If C ∗ is the cost of the
8.5.4stateFinding Good Heuristics
optimal solution path, then we can say the following:
• A∗ expands all nodes with f (n) < C ∗ .
∗
A might then ∗
Since the• availability of expand someheuristics
admissible of the nodes right
is so on the “goal
important for contour”
informed(where (n) = C ) for
searchf(particularly
A∗ ), let us before
see how selecting a goal node.
such heuristics can be obtained in practice. We will look at an example, and
then derive a general procedure from that.
Completeness requires that there be only finitely many nodes with cost less than or equal to
∗ , a condition that is true if all step costs exceed some finite ! and if b is finite.
CAdmissible
Section 3.2. heuristics: Example 8-puzzle
Notice thatExample Problems
A∗ expands 71
no nodes with f (n) > C ∗ —for example, Timisoara is not
expanded in Figure 3.24 even though it is a child of the root. We say that the subtree below
7 2 4 1 2
5 6 3 4 5
8 3 1 6 7 8
Start State Goal State
Figure 3.4 A typical instance of the 8-puzzle.

Example 8.5.27. Let h1 (n) be the number of misplaced tiles in node n
(h1 (S) = 6)• States: A state description specifies the location of each of the eight tiles and the blank
in one of the nine squares.
• Initial state: Any state can be designated as the initial state. Note that any given goal
can be reached from exactly half of the possible initial states (Exercise 3.4).
• Actions: The simplest formulation defines the actions as movements of the blank space
Left, Right, Up, or Down. Different subsets of these are possible depending on where
the blank is.
Example 8.5.28. Let h2 (n) be the total Manhattan distance from desired location
of each tile. (h2 (S) = 2 + 0 + 3 + 1 + 0 + 1 + 3 + 4 = 14)
Observation 8.5.29 (Typical search costs). (IDS =
b iterative deepening search)
nodes explored IDS A∗ (h1 ) A∗ (h2 )

d = 14 3,473,941 539 113
d = 24 too many 39,135 1,641
Dominance
Definition 8.5.30. Let h1 and h2 be two admissible heuristics we say that h2
dominates h1 if h2 (n)≥h1 (n) for all n.
Theorem 8.5.31. If h2 dominates h1 , then h2 is better for search than h1 .
Relaxed problems
Observation: Finding good admissible heuristics is an art!
Idea: Admissible heuristics can be derived from the exact solution cost of a relaxed
version of the problem.
Example 8.5.32. If the rules of the 8-puzzle are relaxed so that a tile can move
anywhere, then we get heuristic h1 .
Example 8.5.33. If the rules are relaxed so that a tile can move to any adjacent
square, then we get heuristic h2 . (Manhattan distance)
Definition 8.5.34. Let Π:=⟨S , A, T , I , G ⟩ be a search problem, then we call
a search problem P r :=⟨S, Ar , T r , I r , G r ⟩ a relaxed problem (wrt. Π; or simply
relaxation of Π), iff A ⊆ Ar , T ⊆ T r , I ⊆ I r , and G ⊆ G r .
Lemma 8.5.35. If Pr relaxes Π, then every solution for Π is one for Pr.
Key point: The optimal solution cost of a relaxed problem is not greater than the
optimal solution cost of the real problem.
Relaxation means to remove some of the constraints or requirements of the original problem,
so that a solution becomes easy to find. Then the cost of this easy solution can be used as an
optimistic approximation of the problem.
Empirical Performance: A∗ in Path Planning

Example 8.5.36 (Live Demo vs. Breadth-First Search).
See http://qiao.github.io/PathFinding.js/visual/
Difference to Breadth-first Search?: That would explore all grid cells in a circle
around the initial state!
8.6 Local Search

Systematic Search vs. Local Search

Definition 8.6.1. We call a search algorithm systematic, if it considers all states
at some point.
Example 8.6.2.
All tree search algorithms (except pure depth first search) are systematic. (given
reasonable assumptions e.g. about costs.)
Observation 8.6.3. Systematic search algorithms are complete.
Observation 8.6.4. In systematic search algorithms there is no limit of the number
of nodes that are kept in memory at any time.
Alternative: Keep only one (or a few) nodes at a time ;
no systematic exploration of all options, ; incomplete.
Local Search Problems

Idea: Sometimes the path to the solution is irrelevant.
8.6. LOCAL SEARCH 119
Example 8.6.5 (8 Queens Problem). Place 8

queens on a chess board, so that no two queens
threaten each other.
This problem has various solutions (the one of
the right isn’t one of them)
Definition 8.6.6. A local search algorithm is a
search algorithm that operates on a single state,
the current state (rather than multiple paths).
(advantage: constant space)
Typically local search algorithms only move to successor of the current state, and
do not retain search paths.
Applications include: integrated circuit design, factory-floor layout, job-shop schedul-

ing, portfolio management, fleet deployment,. . .
Local
Local Search:
Search: Iterative
Iterative improvementalgorithms
improvement algorithms
Local Search: Iterative improvement algorithms
Definition 8.6.7 (Traveling Salesman Problem). Find shortest trip through set
ofIcities such that
Definition each city isSalesman
5.7 (Traveling visited exactly once. Find shortest trip through set
Problem).
of cities such that each city is visited exactly once.
I Idea:
Start 5.7
Definition with(Traveling
any complete tour, perform
Salesman pairwise
Problem). Findexchanges
shortest trip through set
I Idea: Start with any complete tour, perform pairwise exchanges
of cities such that each city is visited exactly once.
I Idea: Start with any complete tour, perform pairwise exchanges
I Definition 5.8 (n-queens problem). Put n queens on n ⇥ n board such that

no two queens
Definition 8.6.8 in(n-queens
the same row, columns, Put
problem). or diagonal.
n queens on n × n board such that
InoI two
Idea:queens
Definition
Move ainqueen
5.8 the same
(n-queens row, columns,
problem).
to reduce number of or
Put ndiagonal.
queens on n ⇥ n board such that
conflicts
no two queens in the same row, columns, or diagonal.
Idea: Move a queen to reduce number of conflicts
I Idea: Move a queen to reduce number of conflicts

Hill-climbing (gradient ascent/descent)

Idea: Start anywhere and go in the direction of the steepest ascent.
Definition 8.6.9. Hill climbing (also gradient ascent)is a local search algorithm
that iteratively selects the best successor:
procedure Hill−Climbing (problem) /∗ a state that is a local minimum ∗/
local current, neighbor /∗ nodes ∗/
current := Make−Node(Initial−State[problem])
loop
neighbor := <a highest−valued successor of current>
if Value[neighbor] < Value[current] return [current] end if
current := neighbor
end loop
end procedure
Intuition:
Like best first search without memory.
Works, if solutions are dense and local maxima can be escaped.
In order to understand the procedure on a more intuitive level, let us consider the following
scenario: We are in a dark landscape (or we are blind), and we want to find the highest hill. The
search procedure above tells us to start our search anywhere, and for every step first feel around,
and then take a step into the direction with the steepest ascent. If we reach a place, where the
next step would take us down, we are finished.
Of course, this will only get us into local maxima, and has no guarantee of getting us into
global ones (remember, we are blind). The solution to this problem is to re-start the search at
random (we do not have any information) places, and hope that one of the random jumps will get
us to a slope that leads to a global maximum.
Example Hill-Climbing with 8 Queens

Idea: Consider h = b number of
Section 4.1.
queens that threaten each other. Local Search Algorithms and Optimization Problems 121
If the path to the goal does not matter, we might consider a different class of algo-
Example 8.6.10. An 8-queens rithms,
stateones that do not worry about paths at all. Local search algorithms operate using
LOCAL SEARCH
a single current node (rather than multiple paths) and generally move only to neighbors
with heuristic cost estimate h of
=that17node. Typically, the paths followed by the search are not retained. Although local
CURRENT NODE
showing h-values for moving a queen

search algorithms are not systematic, they have two key advantages: (1) they use very little
memory—usually a constant amount; and (2) they can often find reasonable solutions in large
within its column or infinite (continuous) state spaces for which systematic algorithms are unsuitable.
In addition to finding goals, local search algorithms are useful for solving pure op-
Problem: The state space hastimization
local problems, in which the aim is to find the best state according to an objective
OPTIMIZATION
PROBLEM
function. Many optimization problems do not fit the “standard” search model introduced in
OBJECTIVE
minima. e.g. the board on the Chapter

right3. For example, nature provides an objective function—reproductive fitness—that
FUNCTION
has h = 1 but every successorDarwinian

has evolution could be seen as attempting to optimize, but there is no “goal test” and
no “path cost” for this problem.
h > 1. To understand local search, we find it useful to consider the state-space landscape (as
STATE-SPACE
LANDSCAPE
in Figure 4.1). A landscape has both “location” (defined by the state) and “elevation” (defined
by the value of the heuristic cost function or objective function). If elevation corresponds to
GLOBAL MINIMUM cost, then the aim is to find the lowest valley—a global minimum; if elevation corresponds
GLOBAL MAXIMUM to an objective function, then the aim is to find the highest peak—a global maximum. (You
can convert from one to the other just by inserting a minus sign.) Local search algorithms
explore this landscape. A complete local search algorithm always finds a goal if one exists;
an optimal algorithm always finds a global minimum/maximum.
Hill-climbing
objective function
global maximum
Problem: Depending on initial

state, can get stuck on local max- shoulder
ima/minima and plateaux. local maximum

“flat” local maximum
“Hill-climbing search is like climbing

Everest in thick fog with amnesia”.
state space
current
state
Figure 4.1 A one-dimensional state-space landscape in which elevation corresponds to the

Idea: Escape local maxima by allowing some
objective “bad”
function. oris torandom
The aim moves.
find the global maximum. Hill-climbing search modifies
the current state to try to improve it, as shown by the arrow. The various topographic features
are defined in the text.
8.6. LOCAL SEARCH 121
Example 8.6.11. local search, simulated annealing. . .

Properties: All are incomplete, nonoptimal.
Sometimes performs well in practice (if (optimal) solutions are dense)
Recent work on hill climbing algorithms tries to combine complete search with randomization to
escape certain odd phenomena occurring in statistical distribution of solutions.
124 Chapter 4. Beyond Classical Search
Simulated annealing (Idea)
Definition 8.6.12. Ridges are ascending

successions of local maxima.
Problem: They are extremely difficult to
bv navigate for local search algorithms.
Idea: Escape local maxima by allowing
some “bad” moves, but gradually decrease
their size and frequency.
Annealing is the process of heating steel and let it cool gradually to give it time to
grow an optimal cristal structure.
Figure 4.4 Illustration of why ridges cause difficulties for hill climbing. The grid of states
(dark circles) is superimposed on a ridge rising from left to right, creating a sequence of local
Simulated annealing is like shaking a ping
maxima pong
that are notball occasionally
directly connected tooneach
a bumpy surface
other. From each local maximum, all the
to free it. available actions point downhill. (so it does not get stuck)
Devised by Metropolis
STOCHASTIC HILL et al for physical process modelling [Met+53]
Many variants of hill climbing have been invented. Stochastic hill climbing chooses at
CLIMBING
random from among the uphill moves; the probability of selection can vary with the steepness
Widely used in VLSI layout, airline scheduling, etc.
of the uphill move. This usually converges more slowly than steepest ascent, but in some
FIRST-CHOICE HILL
CLIMBING state landscapes, it finds better solutions. First-choice hill climbing implements stochastic
Michael Kohlhase:hill climbing
Artificial Intelligence by
2 generating successors
183 randomly until one is generated that is better than the
2023-02-10
current state. This is a good strategy when a state has many (e.g., thousands) of successors.
The hill-climbing algorithms described so far are incomplete—they often fail to find
Simulated annealing (Implementation) a goal when one exists because they can get stuck on local maxima. Random-restart hill
RANDOM-RESTART
HILL CLIMBING climbing adopts the well-known adage, “If at first you don’t succeed, try, try again.” It con-
ducts a series of hill-climbing searches from randomly generated initial states,1 until a goal
Definition 8.6.13. The following is found. algorithmIt is triviallyis complete
called simulated annealing:
with probability approaching 1, because it will eventually
generate a goal state
procedure Simulated−Annealing (problem,schedule) /∗ a solution state ∗/as the initial state. If each hill-climbing search has a probability p of
local node, next /∗ nodessuccess, ∗/ then the expected number of restarts required is 1/p. For 8-queens instances with
local T /∗ a ‘‘temperature’’ no sidewayscontrolling movesprob.~of
allowed, pdownward
≈ 0.14, so we
steps need ∗/roughly 7 iterations to find a goal (6 fail-
ures and
current := Make−Node(Initial−State[problem]) 1 success). The expected number of steps is the cost of one successful iteration plus
for t :=1 to ∞ (1−p)/p times the cost of failure, or roughly 22 steps in all. When we allow sideways moves,
T := schedule[t] 1/0.94 ≈ 1.06 iterations are needed on average and (1 × 21) + (0.06/0.94) × 64 ≈ 25 steps.
if T = 0 return current For 8-queens,
end if then, random-restart hill climbing is very effective indeed. Even for three mil-
lion queens, the approach can find solutions in under a minute. 2
next := <a randomly selected successor of current>
∆(E) := Value[next]−Value[current]
1 Generating a random state from an implicitly specified state space can be a hard problem in itself.
if ∆(E) > 0 current 2:=Luby

next
et al. (1993) prove that it is best, in some cases, to restart a randomized search algorithm after a particular,
else fixed amount of time and that this can be much more efficient than letting each search continue indefinitely.
Disallowing or limiting the number∆(E)/T
current := next <only with probability> e of sideways moves is an example of this idea.
end if
end for
end procedure
A schedule is a mapping from time to “temperature”.
Properties of simulated annealing

At fixed “temperature” T , state occupation probability reaches Boltzman distribu-
tion E(x)
p(x) = αe kT
T decreased slowly enough ; always reach best state x∗ because
E(x∗ )
e kT E(x∗ )−E(x)
E(x)
=e kT ≫1
e kT
for small T .
Question: Is this necessarily an interesting guarantee?
Local beam search

Definition 8.6.14. Local beam search is a search algorithm that keep k states
instead of 1 and chooses the top k of all their successors.
Observation: Local beam search is Not the same as k searches run in parallel!
(Searches that find good states recruit other searches to join them)
Problem: Quite often, all k searches end up on same local hill!

Idea: Choose k successors randomly, biased towards good ones. (Observe the
close analogy to natural selection!)
Genetic algorithms (very briefly)

Definition 8.6.15. A genetic algorithm is a variant of local beam search that
generates successors by
randomly modifying states (mutation)
mixing pairs of states (sexual reproduction or crossover)
to optimize a fitness function. (survival of the fittest)

Example 8.6.16. Generating successors for 8 Queens
Section
8.6. 4.1. SEARCH
LOCAL Local Search Algorithms and Optimization Problems 127 123
24748552 24 31% 32752411 32748552 32748152

32752411 23 29% 24748552 24752411 24752411
24415124 20 26% 32752411 32752124 32252124
32543213 11 14% 24415124 24415411 24415417
(a) (b) (c) (d) (e)

Section 4.1. Local Search Algorithms and Optimization Problems 127
Initial Population Fitness Function Selection Crossover Mutation
Figure 4.6 The genetic algorithm, illustrated for digit strings representing 8-queens states.
24748552
The 24 31%
initial population in (a) is 32752411
32748552 32748152
ranked by the fitness function in (b), resulting in pairs for
mating in (c). They
32752411 23 29% produce offspring
24748552 in (d), which are subject
24752411 to mutation in (e).
24752411
24415124 20 26% 32752411 32752124 32252124
Genetic algorithms
32543213 11 (continued)
14% 24415124 24415411 24415417
(a) (b) (c) (d) (e)

Problem: Genetic Fitness
Initial Population algorithms
Function require
Selection states encoded as strings.
Crossover Mutation
+ =
Crossover Figure
only 4.6
helpsTheiffgenetic algorithm, illustrated for digit strings representing 8-queens states.
substrings are meaningful components.
The initial population in (a) is ranked by the fitness function in (b), resulting in pairs for
mating in (c). They produce offspring in (d), which are subject to mutation in (e).
Example 8.6.17 (Evolving 8 Queens). First crossover
Figure 4.7 The 8-queens states corresponding to the first two parents in Figure 4.6(c) and
the first offspring in Figure 4.6(d). The shaded columns are lost in the crossover step and the
unshaded columns are retained.
+ =
Like beam searches, GAs begin with a set of k randomly generated states, called the
POPULATION population. Each state, or individual, is represented as a string over a finite alphabet—most
INDIVIDUAL commonly, a string of 0s and 1s. For example, an 8-queens state must specify the positions of
8 queens,Figure
each 4.7 The 8-queens
in a column of 8 states
squares, corresponding to the first8two
and so requires parents
× log in Figure 4.6(c) and
2 8 = 24 bits. Alternatively,
the first offspring in Figure 4.6(d). The shaded columns are lost in the crossover step and the
Note: Genetic
the state could be algorithms
represented
unshaded columns are retained.
notequal
as 8 digits, evolution:
each in the range e.g.,
from real
1 togenes
8. (We also encodelater
demonstrate repli-
cation
that machinery!
the two encodings behave differently.) Figure 4.6(a) shows a population of four 8-digit
strings representing 8-queens states.
TheLike
production of the next
beam searches, GAs generation
begin with aofsetstates is shown generated
of k randomly in Figurestates,
4.6(b)–(e). In (b),
called the
FITNESS FUNCTION
POPULATION eachpopulation.
state is rated Eachby state,
the
Michael objective
Kohlhase: Artificial function,
or individual, 2 or (in GA
is represented
Intelligence as aterminology)
string
188 the fitness
over a finite function. A
alphabet—most
2023-02-10
INDIVIDUAL fitness functiona should

commonly, string ofreturn
0s and higher values for
1s. For example, better states,
an 8-queens so, for
state must the the
specify 8-queens problem
positions of
8 queens,
we use each inof
the number a column of 8 squares,
nonattacking pairs and so requires
of queens, which8 × has = 24 bits.
log2a8 value of 28 Alternatively,
for a solution.
The the stateof
values could
the be represented
four states areas24, 8 digits,
23, 20,eachand
in the
11.range fromparticular
In this 1 to 8. (Wevariant
demonstrate
of thelater
genetic
that the the
algorithm, two probability
encodings behaveof being differently.)
chosen Figure 4.6(a) showsisa directly
for reproducing population of four 8-digit
proportional to the
strings representing 8-queens states.
fitness score, and the percentages are shown next to the raw scores.
The production of the next generation of states is shown in Figure 4.6(b)–(e). In (b),
In (c), two pairs are selected at random for reproduction, in accordance with the prob-
FITNESS FUNCTION each state is rated by the objective function, or (in GA terminology) the fitness function. A
fitness function should return higher values for better states, so, for the 8-queens problem
we use the number of nonattacking pairs of queens, which has a value of 28 for a solution.
The values of the four states are 24, 23, 20, and 11. In this particular variant of the genetic
algorithm, the probability of being chosen for reproducing is directly proportional to the
fitness score, and the percentages are shown next to the raw scores.
In (c), two pairs are selected at random for reproduction, in accordance with the prob-
Chapter 9
Adversarial Search for Game Playing
A Video Nugget covering this chapter can be found at https://fau.tv/clip/id/22079.
9.1 Introduction
The Problem (cf. chapter 8)
“Adversarial search” = Game playing against an opponent.
Why Game Playing?

What do you think?
Playing a game well clearly requires a form of “intelligence”.
Games capture a pure form of competition between opponents.
Games are abstract and precisely defined, thus very easy to formalize.
125
126 CHAPTER 9. ADVERSARIAL SEARCH FOR GAME PLAYING
Game playing is one of the oldest sub-areas of AI (ca. 1950).

The dream of a machine that plays chess is, indeed, much older than AI!
“Schachtürke” (1769) “El Ajedrecista” (1912)
“Game” Playing? Which Games?

. . . sorry, we’re not gonna do soccer here.
Restrictions:
Game states discrete, number of game states finite.
Finite number of possible moves.
The game state is fully observable.
The outcome of each move is deterministic.
Two players: Max and Min.
Turn-taking: It’s each player’s turn alternatingly. Max begins.
Terminal game states have a utiliy u. Max tries to maximize u, Min tries to
minimize u.
In that sense, the utility for Min is the exact opposite of the utility for Max
(“zero sum”).
There are no infinite runs of the game (no matter what moves are chosen, a
terminal state is reached after a finite number of steps).
An Example Game
9.1. INTRODUCTION 127
Game states: Positions of figures.

Moves: Given by rules.
Players: White (Max), Black (Min).

Terminal states: Checkmate.
Utility of terminal states, e.g.:
+100 if Black is checkmated.
0 if stalemate.
−100 if White is checkmated.
“Game” Playing? Which Games Not?
Soccer (sorry guys; not even RoboCup)

Important types of games that we don’t tackle here:
Chance. (E.g., Backgammon)

More than two players. (E.g., Halma)
Hidden information. (E.g., most card games)
Simultaneous moves. (E.g., Diplomacy)
Not zero-sum, i.e., outcomes may be beneficial (or detrimental) for both players.
(cf. Game theory: Auctions, elections, economy, politics, . . . )
Many of these more general game types can be handled by similar/extended algo-
rithms.
(A Brief Note On) Formalization
Definition 9.1.1 (Game State Space).

A game state space is a 6 tuple Θ:=⟨S, A, T , I, S T , u⟩ where:
states S, actions A, deterministic transition relation T , initial state I are as in
classical search problems, except:
S is the disjoint union of S Max , S Min , and S T .
Max
A is the disjoint union of A and AMin .
Max a
For a∈A → s then s∈S Max and s′ ∈(S Min ∪ S T ).
, if s − ′
a
For a∈AMin , if s −
→ s′ then s∈S Min and s′ ∈(S Max ∪ S T ).
S T is the set of terminal states.
u : S T →R is the utility function.
Definition 9.1.2 (Commonly used terminology).

position =
b state, end state =
b terminal state, move =
b action.
A round of the game – one move Max, one move Min – is often referred to as a
“move”, and individual actions as “half-moves”. We don’t do that here.
Why Games are Hard to Solve: I

What is a “solution” here?
Definition 9.1.3. Let Θ be a game state space, and let X∈{Max, Min}. A strategy
for X is a function σ X : S X →AX so that a is applicable to s whenever σ X (s) = a.
We don’t know how the opponent will react, and need to prepare for all possibilities.
Definition 9.1.4. A strategy is called optimal if it yields the best possible utility
for X assuming perfect opponent play (not formalized here).
In (almost) all games, computing a strategy is infeasible. Instead, compute the

next move “on demand”, given the current game state.
Why Games are hard to solve II
Example 9.1.5. Number of reachable states in chess: 1040 .
Example 9.1.6. Number of reachable states in go: 10100 .

It’s even worse: Our algorithms here look at search trees (game trees), no
duplicate checking.
Chess: 35100 ≃ 10154 .
Go: 200300 ≃ 10690 .
How To Describe a Game State Space?

Like for classical search problems, there are three possible ways to describe a game:
blackbox/API description, declarative description, explicit game state space.
Question: Which ones do humans use?
Explicit ≈ Hand over a book with all 1040 moves in Chess.

Blackbox ≈ Give possible Chess moves on demand but don’t say how they are
generated.
Answer: Declarative!
With “game description language” =
b natural language.
Specialized vs. General Game Playing

And which game descriptions do computers use?
Explicit: Only in illustrations.
Blackbox/API: Assumed description in (This Chapter)
Method of choice for all those game players out there in the market (Chess
computers, video game opponents, you name it).
Programs designed for, and specialized to, a particular game.
Human knowledge is key: evaluation functions (see later), opening databases

(Chess!!), end game databases.
Declarative: General Game Playing, active area of research in AI.
Generic Game Description Language (GDL), based on logic.
Solvers are given only “the rules of the game”, no other knowledge/input
whatsoever (cf. chapter 8).
Regular academic competitions since 2005.
Our Agenda for This Chapter

Minimax Search: How to compute an optimal strategy?
Minimax is the canonical (and easiest to understand) algorithm for solving
games, i.e., computing an optimal strategy.
Evaluation Functions: But what if we don’t have the time/memory to solve the
entire game?
Given limited time, the best we can do is look ahead as far as we can. Evaluation
functions tell us how to evaluate the leaf states at the cut off.
Alpha-Beta Search: How to prune unnecessary parts of the tree?
Often, we can detect early on that a particular action choice cannot be part of
the optimal strategy. We can then stop considering this part of the game tree.
State of the art: What is the state of affairs, for prominent games, of computer
game playing vs. human experts?
Just FYI (not part of the technical content of this course).

9.2 Minimax Search

“Minimax”?
We want to compute an optimal strategy for player “Max”.
In other words: We are Max, and our opponent is Min.
Recall:
We compute the strategy offline, before the game begins. During the game,
whenever it’s our turn, we just lookup the corresponding action.
Max attempts to maximize the utility u(s) of the terminal state that will be
reached during play.
Min attempts to minimize u(s).
Section 5.2. Optimal Decisions in Games 163
So what?
until we reach leaf nodes corresponding to terminal states such that one player has three in
aThe computation
row or all the squares alternates between
are filled. The minimization
number on each leaf nodeand maximization
indicates ;
the utility value hence
“minimax”.
of the terminal state from the point of view of MAX; high values are assumed to be good for
MAX and bad for MIN (which is how the players get their names).
For tic-tac-toe the game tree is relatively small—fewer than 9! = 362, 880 terminal
nodes. But for chess there are over 1040 nodes, so the game tree is best thought of as a
theoretical construct that we cannot realize in the physical world. But regardless of the size
SEARCH TREE of the game tree, it is MAX’s job to search for a good move. We use the term search tree for a
Example Tic-Tac-Toe
tree that is superimposed on the full game tree, and examines enough nodes to allow a player
to determine what move to make.
MAX (X)
X X X
MIN (O) X X X
X X X
XO X O X ...
MAX (X) O
X O X X O X O ...
MIN (O) X X
... ... ... ...
X O X X O X X O X ...
TERMINAL O X O O X X
O X X O X O O
Utility –1 0 +1
Figure 5.1 A (partial) game tree for the game of tic-tac-toe. The top node is the initial
Game state,
tree,and
current player
MAX moves marked
first, placing an Xon the
in an left.square. We show part of the tree, giving
empty
alternating moves by MIN ( O ) and MAX ( X ), until we eventually reach terminal states, which
Last row:
can beterminal positions
assigned utilities with
according their
to the rulesutility.
of the game.

5.2 O PTIMAL D ECISIONS IN G AMES
Minimax: Outline
In a normal search problem, the optimal solution would be a sequence of actions leading to
a goal state—a terminal state that is a win. In adversarial search, MIN has something to say
STRATEGY about it. MAX therefore must find a contingent strategy, which specifies MAX’s move in
the initial state, then MAX’s moves in the states resulting from every possible response by
9.2. MINIMAX SEARCH 131
We max, we min, we max, we min . . .

1. Depth first search in game tree, with Max in the root.
2. Apply utility function to terminal positions.
3. Bottom-up for each inner node n in the tree, compute the utility u(n) of n as
follows:
If it’s Max’s turn: Set u(n) to the maximum of the utilities of n’s successor
nodes.
If it’s Min’s turn: Set u(n) to the minimum of the utilities of n’s successor
nodes.
4. Selecting a move for Max at the root: Choose one move that leads to a successor
node with maximal utility.
Minimax: Example
Max 3
Min 3 Min 2 Min 2
3 12 8 2 4 6 14 5 2
Blue numbers: Utility function u applied to terminal positions.

Red numbers: Utilities of inner nodes, as computed by the minimax algorithm.
The Minimax Algorithm: Pseudo-Code

Definition 9.2.1. The minimax algorithm (often just called minimax) is given by
the following function whose input is a state s∈S Max , in which Max is to move.
function Minimax−Decision(s) returns an action
v := Max−Value(s)
return an action yielding value v in the previous function call
function Max−Value(s) returns a utility value
if Terminal−Test(s) then return u(s)
v := −∞
for each a ∈ Actions(s) do
v := max(v,Min−Value(ChildState(s,a)))
return v
function Min−Value(s) returns a utility value

v := +∞
v := min(v,Max−Value(ChildState(s,a)))
return v
Minimax: Example, Now in Detail
Max 3−∞
Min ∞
3 Min ∞
2 Min ∞
2514
3 12 8 2 4 6 14 5 2
So which action for Max is returned?

Leftmost branch.
Note: The maximal possible pay-off is higher for the rightmost branch, but assuming
perfect play of Min, it’s better to go left. (Going right would be “relying on your
opponent to do something stupid”.)
Minimax, Pro and Contra

Minimax advantages:
Minimax is the simplest possible (reasonable) search algorithm for games.
(If any of you sat down, prior to this lecture, to implement a Tic-Tac-Toe player,
chances are you either looked this up on Wikipedia, or invented it in the process.)
Returns an optimal action, assuming perfect opponent play.
No matter how the opponent plays, the utility of the terminal state reached
will be at least the value computed for the root.
If the opponent plays perfectly, exactly that value will be reached.
There’s no need to re-run minimax for every game state: Run it once, offline
before the game starts. During the actual game, just follow the branches taken
in the tree. Whenever it’s your turn, choose an action maximizing the value of
the successor states.
9.3. EVALUATION FUNCTIONS 133
Minimax disadvantages: It’s completely infeasible in practice.

When the search tree is too large, we need to limit the search depth and apply
an evaluation function to the cut off states.
9.3 Evaluation Functions

Evaluation Functions for Minimax

Problem: Game tree too big so search through in minimax.
Solution: We impose a search depth limit (also called horizon) d, and apply an
evaluation function to the non-terminal cut off states, i.e. states s with dp(s)>d.
Definition 9.3.1. An evaluation function f maps game states to numbers:
f (s) is an estimate of the actual value of s (as would be computed by unlimited-
depth Minimax for s).
If cut off state is terminal: Just use u instead of f .
Analogy to heuristic functions (cf. section 8.5): We want f to be both (a) accurate
and (b) fast.
Another analogy: (a) and (b) are in contradiction ; need to trade-off accuracy
against overhead.
In typical game playing algorithms today, f is inaccurate but very fast.

(Usually no good methods known for computing accurate f )
Our Example, Revisited: Minimax With Depth Limit d = 2
Max 3
Min 3 Min 2 Min 2
3 12 8 2 4 6 14 5 2
Blue numbers: evaluation function f , applied to the cut-off states at d = 2.

Red numbers: utilities of inner nodes, as computed by minimax using d, f .
Example Chess
Evaluation function in Chess:

Material: Pawn 1, Knight 3, Bishop 3, Rook 5,
Queen 9.
3 points advantage ; safe win.
Mobility: How many fields do you control?
King safety, Pawn structure, . . .
Note how simple this is! (probably is not how
Kasparov evaluates his positions)
Linear Evaluation Functions

Paragraph: How to come up with evaluation functions?
Definition 9.3.2. A common approach is to use a weighted linear function for

f , i.e. given a set of features f i : S→R and a corresponding sequence of weights
wi ∈R, f is of the form f (s):=w1 · f 1 (s) + w2 · f 2 (s) + · · · + wn · f n (s)
Problem: How to obtain these weighted linear functions?
Weights wi can be learned automatically.

The features f i , however, have to be designed by human experts.
Note:
Very fast, very simplistic.
a
Can be computed incrementally: In transition s − → s′ , adapt f (s) to f (s′ ) by
considering only those features whose values have changed.
This assumes that the features (their contribution towards the actual value of the state) are
independent. That’s usually not the case (e.g. the value of a Rook depends on the Pawn struc-
ture).
The Horizon Problem

Problem: Critical aspects of the game can be cut-off by the horizon.
9.4. ALPHA-BETA SEARCH 135
Who’s gonna win here?
White wins (Pawn cannot be prevented from

becoming a queen.)
Black has a +4 advantage in material, so if
we cut-off here then our evaluation function
will say “100, black wins”.
The loss for black is “beyond our horizon” un-
less we search extremely deeply: Black can
hold off the end by repeatedly giving check to
White’s king.
Black to move
So, How Deeply to Search?

Goal: In given time, search as deeply as possible.
Problem: Very difficult to predict search running time. (need an anytime

algorithm)
Solution: iterative deepening search.
Search with depth limit d = 1, 2, 3, . . .
Time’s up: Return result of deepest completed search.
Definition 9.3.3 (Better Solution). The quiescent search algorithm uses a dy-
namically adapted search depth d: It searches more deeply in unquiet positions,
where value of evaluation function changes a lot in neighboring states.
Example 9.3.4. In quiescent search for chess:

piece exchange situations (“you take mine, I take yours”) are very unquiet
; Keep searching until the end of the piece exchange is reached.
9.4 Alpha-Beta Search
When We Already Know We Can Do Better Than This

Max (A) Say n > m.

By choosing to go to the left in search
node (A), Max already can get utility
Min at least n in this part of the game.
value: n
So, if “later on” (further down in the
same sub-tree), in search node (B) we
Min (B) already know that Min can force Max
to get value m < n.
Then Max will play differently in (A)
so we will never actually get to (B).
Max
value: m
Alpha Pruning: Basic Idea

Question: Can we save some work here?
Max 3
Min 3 Min 2 Min 2
3 12 8 2 4 6 14 5 2
Alpha Pruning: Basic Idea (Continued)

Answer: Yes! We already know at this point that the middle action won’t be
taken by Max.
Max ≥3
Min 3 Min ≤2 Min
3 12 8 2
Alpha Pruning
What is α? For each search node n, the highest Max-node utility that search has
encountered on its path from the root to n.
Max −∞;
3; α =α 3= −∞
Min ∞;αα==−∞
3; −∞ Min ∞;αα==33
2; Min
3 12 8 2
How to use α?: In a Min node n, if one of the successors already has utility ≤ α,
then stop considering n. (Pruning out its remaining successors.)
Alpha-Beta Pruning
Recall:
What is α: For each search node n, the highest Max-node utility that search
has encountered on its path from the root to n.
How to use α: In a Min node n, if one of the successors already has utility
≤ α, then stop considering n. (Pruning out its remaining successors.)
Idea: We can use a dual method for Min:
What is β: For each search node n, the lowest Min-node utility that search has
encountered on its path from the root to n.
How to use β: In a Max node n, if one of the successors already has utility
≥ β, then stop considering n. (Pruning out its remaining successors.)
. . . and of course we can use both together!
Alpha-Beta Search: Pseudo-Code

Definition 9.4.1. The alphabeta search algorithm is given by the following pseudo-
code
function Alpha−Beta−Search (s) returns an action
v := Max−Value(s, −∞, +∞)
return an action yielding value v in the previous function call
function Max−Value(s, α, β) returns a utility value

v:= −∞
v := max(v,Min−Value(ChildState(s,a), α, β))
α := max(α, v)
if v ≥ β then return v /∗ Here: v ≥ β ⇔ α ≥ β ∗/
return v
function Min−Value(s, α, β) returns a utility value

v := +∞
v := min(v,Max−Value(ChildState(s,a), α, β))
β := min(β, v)
if v ≤ α then return v /∗ Here: v ≤ α ⇔ α ≥ β ∗/
return v
b Minimax (slide 204) + α/β book-keeping and pruning.

=
Note: Note that α only gets assigned a value in Max nodes, and β only gets assigned a value in
Min nodes.
Alpha-Beta Search: Example

Notation: v; [α, β]
Max −∞;
3; [3, ∞]
[−∞, ∞]
Min ∞;[−∞,
3; [−∞,3]∞] Min ∞;[3,
2; [3,2]∞] Min ∞;[3,
14;
5;
2; ∞]
[3,2]
5]14]
3 12 8 2 14 5 2
Note: We could have saved work by choosing the opposite order for the successors
of the rightmost Min node. Choosing the best moves (for each of Max and Min)
first yields more pruning!
Alpha-Beta Search: Modified Example

Showing off some actual β pruning:
Max 3; [3, ∞]
Min 3; [−∞, 3] Min 2; [3, 2] Min ∞;[3,

5;
2; 5]∞]
[3,2]
Max −∞;
14; [14,
[3,5]5]
3 12 8 2 5 2
14
How Much Pruning Do We Get?

Choosing the best moves first yields most pruning in alphabeta search.
The maximizing moves for Max, the minimizing moves for Min.
Assuming game tree with branching factor b and depth limit d:

Minimax would have to search bd nodes.
Best case: If we always choose the best moves first, then the search tree is
d
reduced to b 2 nodes!
Practice: It is often possible to get very close to the best case by simple move-
ordering methods.
Example Chess:
Move ordering: Try captures first, then threats, then forward moves, then back-
ward moves.
d
From 35d to 35 2 . E.g., if we have the time to search a billion (109 ) nodes, then
Minimax looks ahead d = 6 moves, i.e., 3 rounds (white-black) of the game.
Alpha-beta search looks ahead 6 rounds.
9.5 Monte-Carlo Tree Search (MCTS)

We will now come to the most visible game-play program in recent times: The AlphaGo system
for the game of Go. This has been out of reach of the state of the art (and thus for alphabeta
search) until 2016. This challenge was cracked by a different technique, which we will discuss in
this section.
And now . . .
AlphaGo = Monte Carlo tree search + neural networks
Monte-Carlo Tree Search: Basic Ideas

Observation: We do not always have good evaluation functions.
Definition 9.5.1. For Monte Carlo sampling we evaluate actions through sampling.
When deciding which action to take on game state s:
while time not up do
select action a applicable to s
run a random sample from a until terminal state t
return an a for s with maximal average u(t)
Definition 9.5.2. For the Monte Carlo tree search algorithm (MCTS) we maintain
a search tree T , the MCTS tree.
while time not up do
apply actions within T to select a leaf state s′
select action a′ applicable to s′ , run random sample from a′
add s′ to T , update averages etc.
return an a for s with maximal average u(t)
When executing a, keep the part of T below a.
9.5. MONTE-CARLO TREE SEARCH (MCTS) 141
Compared to alphabeta search: no exhaustive enumeration.

Pro: running time & memory.
Contra: need good guidance how to “select” and “sample”.
This looks only at a fraction of the search tree, so it is crucial to have good guidance where to go,
i.e. which part of the search tree to look at.
Monte-Carlo Sampling: Illustration of Sampling

Idea: Sample the search tree keeping track of the average utilities.
Example 9.5.3 (Single-player, for simplicity). (with adversary, distinguish
max/min nodes)
Expansions: 0, 0, 0
avg. reward: 0, 0, 0 Expan-
sions: 0, 1, 0
avg. reward: 0, 10, 0 Ex-
pansions: 1, 1, 0
pansions: 1, 1, 1
pansions: 1, 1, 2
avg. reward: 70, 10, 35 Ex- Expansions: 0, 0
pansions: 2, 1, 2 avg. reward: 0, 0
pansions: 2, 2, 2
pansions: 2, 2, 2
avg. reward: 60, 55, 35 40
70 50 30
100 10
The sampling goes middle, left, right, right, left, middle. Then it stops and selects the highest-
average action, 60, left. After first sample, when values in initial state are being updated, we
have the following “expansions” and “avg. reward fields”: small number of expansions favored for
exploration: visit parts of the tree rarely visited before, what is out there? avg. reward: high
values favored for exploitation: focus on promising parts of the search tree.
Monte-Carlo Tree Search: Building the Tree

Idea: we can save work by building the tree as we go along.
Example 9.5.4 (Redoing the previous example).
Expansions: 0, 0, 0
sions: 0, 1, 0
sions: 1, 1, 0
Expansions: 1, 0 Expansions: 1 pansions: 1, 1, 1
avg. reward: 70, 0 Ex- avg. reward: 10 avg. reward: 70, 10, 40 Ex-
pansions: 2, 0 Expansions: 2 pansions: 1, 1, 2
avg. reward: 60, 0 avg. reward: 55 avg. reward: 70, 10, 35 Ex-
Expansions: 1, 0
Expansions: 1 pansions: 2, 1, 2 avg. reward: 40, 0 Ex-
avg. reward: 100 avg. reward: 60, 10, 35 Ex- 2, 0
pansions:
pansions: 2, 2, 2 avg. reward: 35, 0
Expansions: 0, 1 pansions: 2, 2, 2 Expansions: 0, 1
avg. reward: 0, 50 avg. reward: 60, 55, 35 avg. reward: 0, 30
40
70 50 30
100 10
This is the exact same search as on previous slide, but incrementally building the search tree,
by always keeping the first state of the sample. The first three iterations middle, left, right, go
to show the tree extension; do point out here that, like the root node, the nodes added to the
tree have expansions and avg reward counters for every applicable action. Then in next iteration
right, after 30 leaf node was found, an important thing is that the averages get updated *along
the entire path*, i.e., not only in the root as we did before, but also in the nodes along the way.
After all six iterations have been done, as before we select the action left, value 60; but we keep
the part of the tree below that action, “saving relevant work already done before”.
How to Guide the Search in MCTS?

How to “sample”?: What exactly is “random”?
Classical formulation: balance exploitation vs. exploration.

Exploitation: Prefer moves that have high average already (interesting regions
of state space)
Exploration: Prefer moves that have not been tried a lot yet (don’t overlook
other, possibly better, options)
UCT: “Upper Confidence bounds applied to Trees” [KS06].
Inspired by Multi-Armed Bandit (as in: Casino) problems.
Basically a formula defining the balance. Very popular (buzzword).
Recent critics (e.g. [FD14]): Exploitation in search is very different from the
Casino, as the “accumulated rewards” are fictitious (we’re only thinking about
the game, not actually playing and winning/losing all the time).
9.5. MONTE-CARLO TREE SEARCH (MCTS) 143
AlphaGo: Overview
Definition 9.5.5 (Neural Networks in AlphaGo).
Policy networks: Given a state s, output a probability distribution over the
actions applicable in s.
Value networks: Given a state s, output a number estimating the game value
of s.
Combination with MCTS:

Policy networks bias the action choices within the MCTS tree (and hence the
leaf state selection), and bias the random samples.
Value networks are an additional source of state values in the MCTS tree, along
with the random samples.
And now in a little more detail
Neural Networks in AlphaGoSystem

Neural network training pipeline and architecture: ARTICLE RESEARCH
a b
Rollout policy SL policy network RL policy network Value network Policy network Value network
Neural network
pS pV pU QT pVU (a⎪s) QT (s′)
Policy gradient
n
Cla
tio
Se
n
ca
ssio
ssifi
lf P
ssifi
lay
gre
ca
Cla
tio
Re
n
Data
s s′
Human expert positions Self-play positions
Figure 1 | Neural network training pipeline and architecture. a, A fast the current player wins) in positions from the self-play data set.
rollout policy pπ and supervised learning (SL) policy network pσ are b, Schematic representation of the neural network architecture used in
trained to predict human expert moves in a data set of positions. AlphaGo. The policy network takes a representation of the board position
Illustration taken from [Sil+16] .
A reinforcement learning (RL) policy network pρ is initialized to the SL s as its input, passes it through many convolutional layers with parameters
policy network, and is then improved by policy gradient learning to σ (SL policy network) or ρ (RL policy network), and outputs a probability
maximize the outcome (that is, winning more games) against previous distribution pσ (a | s) or pρ (a | s) over legal moves a, represented by a
versions Rollout policy p : Simple but fast, ≈ prior work on Go.
π set is generated by playing
of the policy network. A new data probability map over the board. The value network similarly uses many
games of self-play with the RL policy network. Finally, a value network vθ convolutional layers with parameters θ, but outputs a scalar value vθ(s′)
is trainedSL policy network p : Supervised learning, human-expert data (“learn to choose
by regression to predict the expectedσoutcome (that is, whether that predicts the expected outcome in position s′.
an expert action”).
sampled state-action pairs (s, a), using stochastic gradient ascent to and its weights ρ are initialized to the same values, ρ = σ. We play
maximize
RL policyof network
the likelihood the human move pρ :a selected in state s
Reinforcement games betweenself-play
learning, the current policy topρwin”).
network
(“learn and a randomly selected
previous iteration of the policy network. Randomizing from a pool
∂log pσ (a | s ) of opponents
Value network ∆σ ∝ vθ : Use self-play games with pρ asin training
this way stabilizes
datatraining by preventing overfitting
for game-position
∂σ to the current policy. We use a reward function r(s) that is zero for all
evaluation vθ (“predict which player willnon-terminal win in this state”).
time steps t < T. The outcome zt = ± r(sT) is the termi-
We trained a 13-layer policy network, which we call the SL policy nal reward at the end of the game from the perspective of the current
network, from 30 million positions from the KGS Go Server. The net- player at time step t: +1 for winning and −1 for losing. Weights are
work predicted expert moves on a held out test set with an accuracy of then updated at each time step t by stochastic gradient ascent in the
57.0% using all input features, and 55.7% using only raw board posi- direction that 228
Michael Kohlhase: Artificial Intelligence 2 maximizes expected outcome
2023-02-10
25
tion and move history as inputs, compared to the state-of-the-art from
24
other research groups of 44.4% at date of submission (full results in ∂log pρ (a t | s t )
CommentsExtendedon Datathe Figure:
Table 3). Small improvements in accuracy led to large ∆ρ ∝ zt
improvements in playing strength (Fig. 2a); larger networks achieve ∂ρ
a A fastbetter
rollout
accuracy policy pπ and
but are slower supervised
to evaluate during search. learning
We also (SL) policy network pσ are trained to predict
trained a faster but less accurate rollout policy pπ(a|s), using a linear We evaluated the performance of the RL policy network in game
human expert
softmax moves
of small in a data
pattern features set of
(see Extended Datapositions.
Table 4) with A play,reinforcement
sampling each move learning
a t ~ pρ (⋅|s t ) from(RL) policy
its output network
probability
weights π; this achieved an accuracy of 24.2%, using just 2 μs to select distribution over actions. When played head-to-head, the RL policy
an action, rather than 3 ms for the policy network. network won more than 80% of games against the SL policy network.
We also tested against the strongest open-source Go program, Pachi14,
Reinforcement learning of policy networks a sophisticated Monte Carlo search program, ranked at 2 amateur dan
The second stage of the training pipeline aims at improving the policy on KGS, that executes 100,000 simulations per move. Using no search
25,26
pρ is initialized to the SL policy network, and is then improved by policy gradient learning to
maximize the outcome (that is, winning more games) against previous versions of the policy
network. A new data set is generated by playing games of self-play with the RL policy network.
Finally, a value network vθ is trained by regression to predict the expected outcome (that is,
whether the current player wins) in positions from the self-play data set.
b Schematic representation of the neural network architecture used in AlphaGo. The policy net-
work takes a representation of the board position s as its input, passes it through many convo-
lutional layers with parameters σ (SL policy network) or ρ (RL policy network), and outputs a
probability distribution pσ (a|s) or pρ (a|s) over legal moves a, represented by a probability map
over the board. The value network similarly uses many convolutional layers with parameters θ,
but outputs a scalar value vθ (s′ ) that predicts the expected outcome in position s′ .
Neural Networks + MCTS in AlphaGo

Monte
RESEARCH Carlo tree search in AlphaGo:
ARTICLE
a Selection b Expansion c Evaluation d Backup
QT
P P Q Q
Q + u(P) max Q + u(P)
QT QT
Q Q
P P
Q + u(P) max Q + u(P)
pV QT QT QT
P P
pS
r r r r
Figure 3 | Monte Carlo tree search in AlphaGo. a, Each simulation is evaluated in two ways: using the value network vθ; and by running
traverses the tree by selecting the edge with maximum action value Q, a rollout to the end of the game with the fast rollout policy pπ, then
Illustration taken from [Sil+16]
plus a bonus u(P) that depends on a stored prior probability P for that
edge. b, The leaf node may be expanded; the new node is processed once
computing the winner with function r. d, Action values Q are updated to
track the mean value of all evaluations r(·) and vθ(·) in the subtree below
by the policy network pσ and the output probabilities are stored as prior that action.
probabilities Rollout policy p : Action choice in random samples.
P for each action. c, At the end
π of a simulation, the leaf node
learning ofconvolutional
SL policynetworks,
network won 11% pσ of Action
: games against Pachi23 bias
choice (s, a)within the
of the search treeUCTS tree value
stores an action (stored
Q(s, a),as visit“P ”, N(s, a),
count
and 12% against a slightly weaker program, Fuego24. and prior probability P(s, a). The tree is traversed by simulation (that
gets smaller to “u(P )” with number of is,visits); descendingalong
the treewith quality
in complete games Q.without backup), starting
Reinforcement learning of value networks from the root state. At each time step t of each simulation, an action at
RL
The final stage policy
of the trainingnetwork pρ :onNot
pipeline focuses used
position here (used
evaluation, only
is selected fromtostatelearn
st vθ ).
estimating a value function vp(s) that predicts the outcome from posi-
Value
tion s of games played bynetwork
using policyvθp :forUsed to evaluate leaf states s, in
both players28–30
a t =linear
argmax(Q sum
(s t , a )with
+ u(s t , athe
)) value
returned by a random~sample
v p(s ) = E[z |s = s, a p]
on s. a
t t t…T
so as to maximize action value plus a bonus
Ideally, we would like to know the optimal value function under
perfect play v*(s); in practice, we instead estimate the value function P(s, a )
pρ Michael Kohlhase: Artificial Intelligence 2 229 u(s, a ) ∝ 2023-02-10
v for our strongest policy, using the RL policy network pρ. We approx- 1 + N (s, a )
imate the value function using a value network vθ(s) with weights θ,
vθ(s ) ≈ v pρ(s ) ≈ v ⁎(s ) . This neural network has a similar architecture that is proportional to the prior probability but decays with
Comments
to the policyon thebutFigure:
network, outputs a single prediction instead of a prob- repeated visits to encourage exploration. When the traversal reaches a
ability distribution. We train the weights of the value network by regres- leaf node sL at step L, the leaf node may be expanded. The leaf position
a Eachsionsimulation
on state-outcome traverses thestochastic
pairs (s, z), using tree by selecting
gradient descent to thesL isedge
processed with
just oncemaximum
by the SL policy action
network pvalue Q, plus
σ. The output prob- a
minimize the mean squared error (MSE) between the predicted value abilities are stored as prior probabilities P for each legal action a,
bonus u(P ) that depends
vθ(s), and the corresponding outcome z on a stored prior probability P for that edge.
P(s, a ) = pσ (a|s ) . The leaf node is evaluated in two very different ways:
first, by the value network vθ(sL); and second, by the outcome zL of a
∂vθ(s )
b The leaf node may∆θbe ∝ expanded;
∂θ
(z − vθ(s )) the new node israndom processed onceoutby
rollout played untilthe policy
terminal step T network
using the fast p
σ and
rollout
policy pπ; these evaluations are combined, using a mixing parameter
the output probabilities are stored as prior probabilities λ, into a leafP for each
evaluation V(sL) action.
The naive approach of predicting game outcomes from data con-
sisting of complete games leads to overfitting. The problem is that V (sL ) = (1 − λ )vθ(sL ) + λzL
c At the endpositions
successive of a simulation, thediffering
are strongly correlated, leaf nodeby just one is stone,
evaluated in two ways:
but the regression target is shared for the entire game. When trained At the end of simulation, the action values and visit counts of all
on the KGS data set in this way, the value network memorized the traversed edges are updated. Each edge accumulates the visit count and
• using the value
game outcomes network
rather than generalizingvθ ,to new positions, achieving a mean evaluation of all simulations passing through that edge
minimum MSE of 0.37 on the test set, compared to 0.19 on the training
• and bymitigate
set. To running a rollout
this problem, to the
we generated end
a new of the
self-play data game
set n
consisting of 30 million distinct positions, each sampled from a sepa- N (s, a ) = ∑ 1(s, a, i )
i=1
withrate
thegame.
fastEachrollout
game was playedpolicy between
p π,thethen
RL policy network and the winner with function
computing
itself until the game terminated. Training on this data set led to MSEs 1 n r.
Q(s, a ) = ∑ 1(s, a, i)V (siL)
N (s, a ) i =1
of 0.226 and 0.234 on the training and test set respectively, indicating
minimal overfitting. Figure 2b shows the position evaluation accuracy
i
of the value network, compared to Monte Carlo rollouts using the fast where s L is the leaf node from the ith simulation, and 1(s, a, i) indicates
rollout policy pπ; the value function was consistently more accurate. whether an edge (s, a) was traversed during the ith simulation. Once
A single evaluation of vθ(s) also approached the accuracy of Monte the search is complete, the algorithm chooses the most visited move
9.6. STATE OF THE ART 145
d Action values Q are updated to track the mean value of all evaluations r(·) and vθ (·) in the
subtree below that action.
AlphaGo, Conclusion?: This is definitely a great achievement!
• “Search + neural networks” looks like a great formula for general problem solving.
• expect to see lots of research on this in the coming decade(s).
• The AlphaGo design is quite intricate (architecture, learning workflow, training data design,
neural network architectures, . . . ).
• How much of this is reusable in/generalizes to other problems?

• Still lots of human expertise in here. Not as much, like in Chess, about the game itself. But
rather, in the design of the neural networks + learning architecture.
9.6 State of the Art

State of the Art

Some well-known board games:
Chess: Up next.
Othello (Reversi): In 1997, “Logistello” beat the human world champion. Best
computer players now are clearly better than best human players.
Checkers (Dame): Since 1994, “Chinook” is the offical world champion. In
2007, it was shown to be unbeatable: Checkers is solved. (We know the exact
value of, and optimal strategy for, the initial state.)
Go: In 2016, AlphaGo beat the Grandmaster Lee Sedol, cracking the “holy grail”
of board games. In 2017, “AlphaZero” – a variant of AlphaGo with zero prior
knowledge beat all reigning champion systems in all board games (including
AlphaGo ) 100/0 after 24h of self-play.
Intuition: Board Games are considered a “solved problem” from the AI per-
spective.
Computer Chess: “Deep Blue” beat Garry Kasparov in 1997

6 games, final score 3.5 : 2.5.
Specialized Chess hardware, 30 nodes with
16 processors each.
Alphabeta search plus human knowledge.
(more details in a moment)
Nowadays, standard PC hardware plays at

world champion level.
Computer Chess: Famous Quotes
The chess machine is an ideal one to start with, since (Claude Shannon (1949))
1. the problem is sharply defined both in allowed operations (the moves) and in the
ultimate goal (checkmate),
2. it is neither so simple as to be trivial nor too difficult for satisfactory solution,
3. chess is generally considered to require “thinking” for skilful play, [. . . ]
4. the discrete structure of chess fits well into the digital nature of modern comput-
ers.
Chess is the drosophila of Artificial Intelligence. (Alexander Kronrod (1965))
Computer Chess: Another Famous Quote

In 1965, the Russian mathematician Alexander Kronrod said, “Chess is the Drosophila
of artificial intelligence.”
However, computer chess has developed much as genetics might have if the geneti-
cists had concentrated their efforts starting in 1910 on breeding racing Drosophilae.
We would have some science, but mainly we would have very fast fruit flies. (John
McCarthy (1997))
9.7 Conclusion
Summary
Games (2-player turn-taking zero-sum discrete and finite games) can be understood
as a simple extension of classical search problems.
Each player tries to reach a terminal state with the best possible utility (maximal
vs. minimal).
Minimax searches the game depth-first, max’ing and min’ing at the respective turns
of each player. It yields perfect play, but takes time O(bd ) where b is the branching
factor and d the search depth.
Except in trivial games (Tic-Tac-Toe), Minimax needs a depth limit and apply an
evaluation function to estimate the value of the cut-off states.
Alpha-beta search remembers the best values achieved for each player elsewhere in
the tree already, and prunes out sub-trees that won’t be reached in the game.
9.7. CONCLUSION 147
Monte Carlo tree search (MCTS) samples game branches, and averages the findings.
AlphaGo controls this using neural networks: evaluation function (“value network”),
and action filter (“policy network”).
Suggested Reading:
• Chapter 5: Adversarial Search, Sections 5.1 – 5.4 [RN09].

– Section 5.1 corresponds to my “Introduction”, Section 5.2 corresponds to my “Minimax Search”,
Section 5.3 corresponds to my “Alpha-Beta Search”. I have tried to add some additional clarify-
ing illustrations. RN gives many complementary explanations, nice as additional background
reading.
– Section 5.4 corresponds to my “Evaluation Functions”, but discusses additional aspects relating
to narrowing the search and look-up from opening/termination databases. Nice as additional
background reading.
– I suppose a discussion of MCTS and AlphaGo will be added to the next edition . . .
Chapter 10
Constraint Satisfaction Problems
In the last chapters we have studied methods for “general problem”, i.e. such that are applicable
to all problems that are expressible in terms of states and “actions”. It is crucial to realize that
these states were atomic, which makes the algorithms employed (search algorithms) relatively
simple and generic, but does not let them exploit the any knowledge we might have about the
internal structure of states.
In this chapter, we will look into algorithms that do just that by progressing to factored states
representations. We will see that this allows for algorithms that are many orders of magnitude
more efficient than search algorithms.
To give an intuition for factored states representations we, we present some motivational ex-
amples in section 10.1 and go into detail of the Waltz algorithm, which gave rise to the main
ideas of constraint satisfaction algorithms in section 10.2. section 10.3 and section 10.4 define
constraint satisfaction problems formally and use that to develop a class of backtracking/search
based algorithms. The main contribution of the factored states representations is that we can
formulate advanced search heuristics that guide search based on the structure of the states.
10.1 Constraint Satisfaction Problems: Motivation

A (Constraint Satisfaction) Problem

Example 10.1.1 (Tournament Schedule). Who’s going to play against who,
when and where?
149
150 CHAPTER 10. CONSTRAINT SATISFACTION PROBLEMS
Constraint Satisfaction Problems (CSPs)

Standard search problem: state is a “black box” any old data structure that supports
goal test, eval, successor state, . . .
Definition 10.1.2.
A constraint satisfaction problem (CSP) is a search problem, where the states are
given by a finite set V :={X 1 , . . ., X n } of variables and domains {Dv |v∈V } and the
goal state are specified by a set of constraints specifying allowable combinations of
values for subsets of variables.
Definition 10.1.3. A constraint network is satisfiable, iff it has a solution a total,

consistent variable assignment.
Definition 10.1.4. The process of finding solutions to CSPs is called constraint
solving.
Remark 10.1.5. We are using factored representation for world states now.
Simple example of a formal representation language
Allows useful general-purpose algorithms with more power than standard tree
search algorithm.
Another Constraint Satisfaction Problem

Example 10.1.6 (SuDoKu). Fill the cells with row/column/block-unique digits
10.1. CONSTRAINT SATISFACTION PROBLEMS: MOTIVATION 151
Variables: The 81 cells.

Domains: Numbers 1, . . . , 9.
Constraints: Each number only once in each row, column, block.
CSP Example: Map-Coloring

Definition 10.1.7. Given a map M , the map coloring problem is to assign colors
to regions in a map so that no adjoining regions have the same color.
Example
204 10.1.8 (Map coloring in Chapter
Australia).
6. Constraint Satisfaction Problems
NT
Variables: WA, NT, Q,
Q NSW, V, SA, T
Northern WA
Territory
Queensland
Domains: Di = {red, green, blue}
Western SA NSW
Australia
South Constraints: adjacent regions must
have different colors e.g., WA ̸= NT
Australia New
South V
Wales
(if the language allows this), or
Victoria
⟨WA, NT⟩∈{⟨red, green⟩, ⟨red, blue⟩, ⟨green, red⟩, . . . }
Tasmania
T
(a) (b)
Figure 6.1 (a) The principal states and territories of Australia. Coloring this map can
be viewed as a constraint satisfaction problem (CSP). The goal is to assign colors to each
represented as a constraint graph.

Intuition: solutions map variables
region so that no neighboring regions have the same color. (b) The map-coloring problem

to domain values satisfying all con-
immediately discard further refinements of the partial assignment. Furthermore, we can see
straints,
why the assignment is not a solution—we see which variables violate a constraint—so we can
focus attention on the variables that matter. As a result, many problems that are intractable
for regular state-space search can be solved quickly e.g.,formulated
when {WA as=a red,CSP. NT = green, . . .}
6.1.2 Example problem: Job-shop scheduling

Factories have the problem of scheduling a day’s worth of jobs, subject to various constraints.
In practice, many of these problems are solved with CSP techniques. Consider the problem of
scheduling the assembly of a car. The whole job is composed of tasks, and we can model each
task as a variable, where the value of each variable is the time that the task starts, expressed
as an integer number of minutes. Constraints can assert that one task must occur before
another—for example, a wheel must be installed before the hubcap is put on—and that only
so many tasks can go on at once. Constraints can also specify that a task takes a certain
amount of time to complete.
Bundesliga ConstraintsWe consider a small part of the car assembly, consisting of 15 tasks: install axles (front
and back), affix all four wheels (right and left, front and back), tighten nuts for each wheel,
affix hubcaps, and inspect the final assembly. We can represent the tasks with 15 variables:
Variables: vAvs.B X =where A and
{Axle F , Axle B , Wheel BRFare teams,
, Wheel LF , Wheelwith domains
RB , Wheel LB , Nuts RF{1,
, . . . ,34}: For each
Nuts LF , Nuts RB , Nuts LB , Cap RF , Cap LF , Cap RB , Cap LB , Inspect } .
match, the The index of the weekend where it is scheduled.
value of each variable is the time that the task starts. Next we represent precedence
PRECEDENCE
CONSTRAINTS constraints between individual tasks. Whenever a task T1 must occur before task T2 , and
(Some) constraints:
task T1 takes duration d1 to complete, we add an arithmetic constraint of the form
T1 + d1 ≤ T2 .
If {A, B} ∩ {C, D} ̸= ∅: vAvs.B ̸=

vCvs.D (each team only one match per
day).
If {A, B} = {C, D}: vAvs.B ≤ 17 <

vCvs.D or vCvs.D ≤ 17 < vAvs.B
(each pairing exactly once in each
half-season).
If A = C: vAvs.B + 1 ̸= vCvs.D
(each team alternates between home
matches and away matches).
Leading teams of last season meet
near the end of each half-season.
...
How to Solve the Bundesliga Constraints?

306 nested for-loops (for each of the 306 matches), each ranging from 1 to 306.
Within the innermost loop, test whether the current values are (a) a permutation
and, if so, (b) a legal Bundesliga schedule.
Estimated running time: End of this universe, and the next couple billion ones
after it . . .
Directly enumerate all permutations of the numbers 1, . . . , 306, test for each whether
it’s a legal Bundesliga schedule.
Estimated running time: Maybe only the time span of a few thousand uni-
verses.
View this as variables/constraints and use backtracking (this chapter)
Executed running time: About 1 minute.
How do they actually do it?: Modern computers and CSP methods: fractions
of a second. 19th (20th/21st?) century: Combinatorics and manual work.
Try it yourself: with an off-the shelf CSP solver, e.g. Minion [Min]
More Constraint Satisfaction Problems

10.1. CONSTRAINT SATISFACTION PROBLEMS: MOTIVATION 153
Traveling Tournament Problem Scheduling
Timetabling Radio Frequency Assignment
1. U.S. Major League Baseball, 30 teams, each 162 games. There’s one crucial additional difficulty,
in comparison to Bundesliga. Which one? Travel is a major issue here!! Hence “Traveling
Tournament Problem” in reference to the TSP.
2. This particular scheduling problem is called “car sequencing”, how to most efficiently get cars
through the available machines when making the final customer configuration (non-standard/flexible/custom
extras).
3. Another common form of scheduling . . .

4. The problem of assigning radio frequencies so that all can operate together without noticeable
interference. Variabledomains are available frequencies, constraints take form of |x − y| > δxy ,
where delta depends on the position of x and y as well as the physical environment.
Our Agenda for This Topic

Our treatment of the topic “Constraint Satisfaction Problems” consists of Chap-
ters 7 and 8. in [RN03]
This Chapter: Basic definitions and concepts; naïve backtracking search.
Sets up the framework. Backtracking underlies many successful algorithms for
solving constraint satisfaction problems (and, naturally, we start with the sim-
plest version thereof).
Next Chapter: Inference and decomposition methods.
Inference reduces the search space of backtracking. Decomposition methods
break the problem into smaller pieces. Both are crucial for efficiency in practice.


How are constraint networks, and assignments, consistency, solutions: How are
constraint satisfaction problems defined? What is a solution?
Get ourselves on firm ground.
Naïve Backtracking: How does backtracking work? What are its main weak-
nesses?
Serves to understand the basic workings of this wide-spread algorithm, and to
motivate its enhancements.
Variable- and Value Ordering: How should we guide backtracking search?
Simple methods for making backtracking aware of the structure of the problem,
and thereby reduce search.
10.2 The Waltz Algorithm
We will now have a detailed look at the problem (and innovative solution) that started the
field of constraint satisfaction problems.
Background:
Adolfo Guzman worked on an algorithm to count the number of simple objects (like children’s
blocks) in a line drawing. David Huffman formalized the problem and limited it to objects in
general position, such that the vertices are always adjacent to three faces and each vertex is
formed from three planes at right angles (trihedral). Furthermore, the drawings could only have
three kinds of lines: object boundary, concave, and convex. Huffman enumerated all possible
configurations of lines around a vertex. This problem was too narrow for real-world situations, so
Waltz generalized it to include cracks, shadows, non-trihedral vertices and light. This resulted in
over 50 different line labels and thousands of different junctions. [ILD]
The Waltz Algorithm

Remark: One of the earliest examples of applied CSPs.
Motivation: Interpret line drawings of polyhedra.
Problem: Are intersections convex or concave? (interpret =

b label as such)
10.2. THE WALTZ ALGORITHM 155
Idea: Adjacent intersections impose constraints on each other. Use CSP to find a
unique set of labelings.
Waltz Algorithm on Simple Scenes

Assumptions: All objects
have no shadows or cracks,

have only three-faced vertices,
are in “general position”, i.e. no junctions change with small movements of the
eye.
Observation 10.2.1. Then each line on the images is one of the following:
a boundary line (edge of an object) (<) with right hand of arrow denoting “solid”
and left hand denoting “space”
an interior convex edge (label with “+”)
an interior concave edge (label with “-”)
18 Legal Kinds of Junctions

Observation 10.2.2. There are only 18 “legal” kinds of junctions:
Idea: given a representation of a diagram

label each junction in one of these manners (lots of possible ways)
junctions must be labeled, so that lines are labeled consistently
Fun Fact: CSP always works perfectly! (early success story for CSP [Wal75])
Waltz’s Examples
In his dissertation 1972 [Wal75] David Waltz used the following examples
Waltz Algorithm (More Examples): Ambiguous Figures

10.3. CSP: TOWARDS A FORMAL DEFINITION 157
Waltz Algorithm (More Examples): Impossible Figures
10.3 CSP: Towards a Formal Definition
We will now work our way towards a definition of CSPs that is formal enough so that we can
define the concept of a solution. This gives use the necessary grounding to talk about algorithms
later. Video Nuggets covering this section can be found at https://fau.tv/clip/id/22277
and https://fau.tv/clip/id/22279.
Types of CSPs
Definition 10.3.1. We call a CSP discrete, iff all of the variables have countable
domains; we have two kinds:
finite domains (size d ; O(dn ) solutions)
e.g., Boolean CSPs (solvability =
b Boolean satisfiability ; NP complete)
infinite domains (e.g. integers, strings, etc.)
e.g., job scheduling, variables are start/end days for each job
need a “constraint language”, e.g., StartJob1 + 5≤StartJob3
linear constraints decidable, nonlinear ones undecidable
Definition 10.3.2. We call a CSP continuous, iff one domain is uncountable.

Example 10.3.3. Start/end times for Hubble Telescope observations form a con-
tinuous CSP.
Theorem 10.3.4. Linear constraints solvable in poly time by linear programming

methods.
Theorem 10.3.5. There cannot be optimal algorithms for nonlinear constraint
systems.
Types of Constraints
We classify the constraints by the number of variables they involve.
Definition 10.3.6. Unary constraints involve a single variable, e.g., SA ̸= green.
Definition 10.3.7. Binary constraints involve pairs of variables, e.g., SA ̸= WA.
Definition 10.3.8. Higher order constraints involve n = 3 or more variables, e.g.,
cryptarithmetic column constraints.
The number n of variables is called the order of the constraint.
Definition 10.3.9. Preferences (soft constraint) (e.g., red
is better than green) are often representable by a cost for each variable assignment
; constrained optimization problems.
Non-Binary Constraints, e.g. “Send More Money”

Example 10.3.10 (Send More Money). A student writes home:
S E N D Puzzle: letters stand for digits, addition should

+ M O R E work out (parents send MONEY€)
M O N E Y
Variables: S, E, N, D, M, O, R, Y , each with domain {0, . . . ,9}.

Constraints:
1. all variables should have different values: S ̸= E, S ̸= N , . . .
2. first digits are non-zero: S ̸= 0, M ̸= 0.
3. the addition scheme should work out: i.e.
1000 · S + 100 · E + 10 · N + D + 1000 · M + 100 · O + 10 · R + E = 10000 · M +
1000 · 0 + 100 · N + 10 · E + Y .
BTW: The solution is S 7→ 9, E 7→ 5, N 7→ 6, D 7→ 7, M 7→ 1, O 7→ 0, R 7→

8, Y 7→ 2 ; parents send 10652€
Definition 10.3.11. Problems like the one in Example 10.3.10 are called crypto
arithmetic puzzles.

Encoding Higher-Order Constraints as Binary ones
Problem: The last constraint is of order 8. (n = 8 variables involved)

Observation 10.3.12. We can write the addition scheme constraint column wise
using auxiliary variables, i.e. variables that do not “occur” in the original problem.
D+E = Y + 10 · X1
S E N D
X1 + N + R = E + 10 · X2
+ M O R E
X2 + E + O = N + 10 · X3 M O N E Y
X3 + S + M = O + 10 · M
These constraints are of order ≤ 5.

General Recipe: For n≥3, encode C(v1 , . . . , vn−1 , vn ) as
C(p1 (x), . . . , pn−1 (x), vn ) ∧ v1 = p1 (x) ∧ . . . ∧ vn−1 = pn−1 (x)
Problem: The problem structure gets hidden. (search algorithms can get
confused)
Constraint Graph
Definition 10.3.13. A binary CSP is a CSP where each constraint is binary.
Observation 10.3.14. A binary CSP forms a graph called the constraint graph
whose nodes are variables, and whose edges represent the constraints.
Example
204 204 10.3.15. Australia as a binary
ChapterCSP
6.
Chapter 6.Constraint Satisfaction
Constraint Problems
Satisfaction Problems
NT NT
Q Q
NorthernNorthern WA WA
Territory
Territory
Queensland
Queensland
WesternWestern
Australia
SA SA NSW NSW
Australia
South South
Australia
Australia New New
South South V
Wales Wales
V
VictoriaVictoria
Tasmania
Tasmania
T T
(a) (a) (b) (b)
Intuition: General-purpose
be viewed as a constraint
be viewed CSP
satisfaction
as a constraint algorithms
problem
satisfaction use
(CSP).(CSP).
problem The Thethe
goal isgoal graph
to assign
is colorsstructure
to assign to each to speed up
to each
colors
search. regionregion
so thatsonothat
represented
neighboring
no neighboring
as a constraint
represented
(E.g.,
regions
graph.graph.
as a constraint
have the
regions Tasmania
same
have the color. is
(b) The
same color. an independent
(b)map-coloring problem subproblem!)
problem
The map-coloring
immediately discard
immediately furtherfurther
discard refinements of theofpartial
refinements assignment.
the partial Furthermore,
assignment. we can
Furthermore, wesee
can see
why the
whyassignment is not isa solution—we
the assignment see which
not a solution—we
Michael Kohlhase: Artificial Intelligence 2
variables
see which violate
variables a constraint—so
violate
254
we
a constraint—so can
we can
2023-02-10
focus focus
attention on theonvariables
attention that matter.
the variables As a result,
that matter. manymany
As a result, problems that are
problems thatintractable
are intractable
for regular state-space
for regular searchsearch
state-space can becan
solved quickly
be solved when when
quickly formulated as a CSP.
formulated as a CSP.
6.1.26.1.2
Example problem:
Example Job-shop
problem: scheduling
Job-shop scheduling
Factories have the
Factories haveproblem of scheduling
the problem a day’s
of scheduling worthworth
a day’s of jobs,of subject to various
jobs, subject constraints.
to various constraints.
In practice, manymany
In practice, of these problems
of these are solved
problems with CSP
are solved withtechniques.
CSP techniques.Consider the problem
Consider of of
the problem
scheduling the assembly
scheduling of a car.
the assembly of aThe
car.whole job is job
The whole composed
is composedof tasks, and we
of tasks, can
and wemodel each each
can model
task as a variable,
task wherewhere
as a variable, the value of each
the value of variable is the istime
each variable the that
timethethattask
thestarts, expressed
task starts, expressed
as an asinteger number
an integer of minutes.
number of minutes.Constraints can assert
Constraints that one
can assert that task
one must occuroccur
task must beforebefore
another—for example,
another—for a wheel
example, must must
a wheel be installed beforebefore
be installed the hubcap is putison—and
the hubcap that only
put on—and that only
so many tasks tasks
so many can gocanongoat on
once. Constraints
at once. can also
Constraints can specify that athat
also specify taska takes a certain
task takes a certain
Real-world CSPs
Example 10.3.16 (Assignment problems). e.g., who teaches what class
Example 10.3.17 (Timetabling problems). e.g., which class is offered when and
where?
Example 10.3.18 (Hardware configuration).
Example 10.3.19 (Spreadsheets).
Example 10.3.20 (Transportation scheduling).
Example 10.3.21 (Factory scheduling).

Example 10.3.22 (Floorplanning).
Note: many real-world problems involve real-valued variables ; continuous CSPs.
Constraint Satisfaction Problems (Formal Definition)

Definition 10.3.23. A constraint network is a triple ⟨V , D, C ⟩, where
V is a finite set of variables,
D:={Dv |v∈V } the set of their domains, and
C:={C uv ⊆ Du ×Dv |u, v∈V and u ̸= v} is a set of constraints with C uv =
C −1
vu .
We call the undirected graph ⟨V , {(u,v)∈V 2 |C uv ̸= Du × Dv }⟩, the constraint

graph of γ.
We will talk of CSPs and mean constraint networks.
Remarks:
The mathematical formulation gives us a lot of leverage:
b possible assignments to variables u and v

C uv ⊆ Du ×Dv =
Relations are the most general formalization, generally we use symbolic formu-
lations, e.g. “u = v” for the relation C uv = {(a,b)|a = b} or “u ̸= v”.
We can express unary constraint Cu by restricting the domain of v: Dv :=Cv .
Example: SuDoKu as a Constraint Network

Example 10.3.24 (Formalize SuDoKu). We use the added formality to encode
SuDoKu as a constraint network, not just as a CSP as Example 10.1.6.
Variables: V = {vij |1 ≤ i, j ≤ 9}: vij =cell row i column j.

Domains For all v∈V : Dv = D = {1, . . . ,9}.
Unary constraint: Cvij = {d} if cell i, j is pre-filled with d.
(Binary) constraint: C vij vi′ j′ =
b “vij ̸= vi′ j ′ ”, i.e.
C vij vi′ j′ = {(d,d′ )∈D × D|d ̸= d′ }, for: i = i′ (same row), or j = j ′ (same
′ ′
column), or (⌈ 3i ⌉,⌈ 3j ⌉) = (⌈ i3 ⌉,⌈ j3 ⌉) (same block).
Note that the ideas are still the same as Example 10.1.6, but in constraint networks
we have a language to formulate things precisely.
Constraint Networks (Solutions)

Let γ:=⟨V , D, C ⟩ be a constraint network.
S
Definition 10.3.25. We call a partial function a : V ⇀ u∈V Du a variable assign-
ment if a(v)∈Dv for all v∈dom(V ).
S
Definition 10.3.26. Let C:=⟨V , D, C ⟩ be a constraint network and a : V ⇀ v∈V Dv
a variable assignment. We say that a satisfies (otherwise violates) a constraint C uv ,
iff (a(u),a(v))∈C uv . a is called consistent in C, iff it satisfies all constraints in C.
A value v∈Du is legal for a variable u in C, iff {(u,v)} is a consistent assignment
in C. A variable with illegal value under a is called conflicted.
Example 10.3.27. The empty assignment ϵ is (trivially) consistent in any con-
straint network.
Definition 10.3.28. Let f and g be variable assignments, then we say that f
extends (or is an extension of) g, iff dom(g)⊂dom(f ) and f |dom(g) = g.
Definition 10.3.29. We call a consistent (total) assignment a solution for γ and

γ itself solvable or satisfiable.
How it all fits together

Lemma 10.3.30. Higher order constraints can be transformed into equi-satisfiable
binary constraints using auxiliary variables.

Corollary 10.3.31. Any CSP can be represented by a constraint network.
In other words The notion of a constraint network is a refinement of that of a
CSP.
So we will stick to constraint networks in this course.
Observation 10.3.32. We can view a constraint network as a search problem, if
we take the states as the variable assignments, the actions as assignment extensions,
and the goal states as consistent assignments.
Idea: We will explore that idea for algorithms that solve constraint networks.
10.4 CSP as Search

We now follow up on Observation 10.3.32 to use search algorithms for solving constraint
networks.
The key point of this section is that the factored states representations realized by constraint
networks allow the formulation of very powerful heuristics. A Video Nugget covering this
section can be found at https://fau.tv/clip/id/22319.
Standard search formulation (incremental)

Idea: Every constraint network induces a single state problem.
State are defined by the values assigned so far
States are variable assignments
Initial state: the empty assignment, ∅ W A = red W A = green W A = blue
Actions: extend current assignment a by a pair W A = red W A = red

N T = green N T = blue
(x,v) that does not conflicted with a.
; fail if no consistent assignments exist (not W A = red W A = red
fixable!)
N T = green N T = green
Q = red Q = blue
Goal test: the current assignment is total.
Remark: This is the same for all CSPs! ,

Observation: Every solution appears at depth n with n variables.
Idea: Use depth first search!

Path is irrelevant, so can also use complete-state formulation
Branching factor b = (n − ℓ)d at depth ℓ, hence n!dn leaves!!!! /
Backtracking Search
10.4. CSP AS SEARCH 163
Assignments for different variables are independent!

e.g. first WA = red then NT = green vs. first NT = green then WA = red
; we only need to consider assignments to a single variable at each node
; b = d and there are dn leaves.
Definition 10.4.1. Depth first search for CSPs with single-variable assignment
extensions actions is called backtracking search.
Backtracking search is the basic uninformed algorithm for CSPs.
Can solve the n-queens problem for ≊ n = 25.
Backtracking Search (Implementation)

Definition 10.4.2. The generic backtracking search algorithm
procedure Backtracking−Search(csp ) returns solution/failure
return Recursive−Backtracking (∅, csp)
procedure Recursive−Backtracking (assignment) returns soln/failure
if assignment is complete then return assignment
var := Select−Unassigned−Variable(Variables[csp], assignment, csp)
foreach value in Order−Domain−Values(var, assignment, csp) do
if value is consistent with assignment given Constraints[csp] then
add {var = value} to assignment
result := Recursive−Backtracking(assignment,csp)
if result notequal failure then return result
remove {var= value} from assignment
return failure
Backtracking in Australia
Example 10.4.3. We apply backtracking search for a map coloring problem:
Step 1:
Step 2:
Step 3:
10.4. CSP AS SEARCH 165
Step 4:
Improving backtracking efficiency

General-purpose methods can give huge gains in speed for backtracking search.
Answering the following questions well helps find powerful heuristics:
1. Which variable should be assigned next? (i.e. a variable ordering heuristic)

2. In what order should its values be tried? (i.e. a value ordering heuristic)
3. Can we detect inevitable failure early? (for pruning strategies)
4. Can we take advantage of problem structure? (; inference)

Observation: Questions 1/2 correspond to the missing subroutines Select−Unassigned−Variable
and Order−Domain−Values from Definition 10.4.2.
Heuristic: Minimum Remaining Values (Which Variable)

Definition 10.4.4. The minimum remaining values (MRV) heuristic for backtrack-
ing search always chooses the variable with the fewest legal values, i.e. a variable
that minimizes #({d∈Dv |a ∪ {v7→d} is consistent}).
Intuition: By choosing a most constrained variable v first, we reduce the outde-

greebranching factor (number of sub trees generated for v) and thus reduce the size
of our search tree.
Extreme case: If #({d∈Dv |a ∪ {v7→d} is consistent}) = 1, then the value as-
signment to v is forced by our previous choices.
Example 10.4.5. In step 3 of Example 10.4.3, there is only one remaining value
for SA!
Degree Heuristic (Variable Order Tie Breaker)
Problem: Need a tie-breaker among MRV variables! (there was no preference in

step 1,2)
Definition 10.4.6. The degree heuristic in backtracking search always chooses a
most constraining variable, i.e. always pick a v with #({v∈(V \dom(a))|C uv ∈C})
maximal.
By choosing a most constraining variable first, we detect inconsistencies earlier on
and thus reduce the size of our search tree.
Commonly used strategy combination: From the set of most constrained vari-
able, pick a most constraining variable.
Example 10.4.7.
10.5. CONCLUSION & PREVIEW 167
Degree heuristic: SA = 5, T = 0, all others 2 or 3.
Where in Example 10.4.7 does the most constraining variable play a role in the choice? SA (only
possible choice), NT (all choices possible except WA, V, T). Where in the illustration does most
constrained variable play a role in the choice? NT (all choices possible except T), Q (only Q and
WA possible).
Least Constraining Value Heuristic (Value Ordering)

Definition 10.4.8. Given a variable, the least constraining value heuristic chooses
the least constraining value: the one that rules out the fewest values in the remaining
variables, i.e. for a given variable v pick a value d∈Dv that minimizes
#({e∈(Du \dom(a))|C uv ∈C and (e,d)̸∈C uv })
By choosing the least constraining value first, we increase the chances to not rule
out the solutions below the current node.
Example 10.4.9.
Combining these heuristics makes 1000 queens feasible.
10.5 Conclusion & Preview

Summary & Preview
Summary of “CSP as Search”:
Constraint networks γ consist of variables, associated with finite domains, and

constraints which are binary relations specifying permissible value pairs.
A variable assignment a maps some variables to values. a is consistent if it
complies with all constraints. A consistent total assignment is a solution.
The constraint satisfaction problem (CSP) consists in finding a solution for a

constraint network. This has numerous applications including, e.g., scheduling
and timetabling.
Backtracking search assigns variable one by one, pruning inconsistent variable
assignments.
Variable orderings in backtracking can dramatically reduce the size of the search
tree. Value orderings have this potential (only) in solvable sub trees.
Up next: Inference and decomposition, for improved efficiency.
Suggested Reading:
• Chapter 6: Constraint Satisfaction Problems, Sections 6.1 and 6.3, in [RN09].
– Compared to our treatment of the topic “Constraint Satisfaction Problems” (chapter 10 and
chapter 11), RN covers much more material, but less formally and in much less detail (in par-
ticular, my slides contain many additional in-depth examples). Nice background/additional
reading, can’t replace the lecture.
– Section 6.1: Similar to my “Introduction” and “Constraint Networks”, less/different examples,
much less detail, more discussion of extensions/variations.
– Section 6.3: Similar to my “Naïve Backtracking” and “Variable- and Value Ordering”, with
less examples and details; contains part of what I cover in chapter 11 (RN does inference first,
then backtracking). Additional discussion of backjumping.
Chapter 11
Constraint Propagation
In this chapter we discuss another idea that is central to symbolic AI as a whole. The first com-
ponent is that with the factored states representations, we need to use a representation language
for (sets of) states. The second component is that instead of state-level search, we can graduate
to representation-level search (inference), which can be much more efficient that state level search
as the respective representation language actions correspond to groups of state-level actions.
11.1 Introduction
Illustration: Inference
Example 11.1.1.
A constraint network γ: 204 Chapter 6. Constraint Satisfaction Problems
NT
Q
Northern WA
Territory
Queensland
Western
Australia
SA NSW
South
Australia New
South V
Wales
Victoria
Tasmania
T
(a) (b)
Question: An additional constraint we can add

no neighboring without
regions have the same color. losing anyproblem
(b) The map-coloring solutions?
region so that
Example 11.1.2. immediately discard further refinements of the partial assignment. Furthermore, we can see
C WAQ := “=”. If WA and Q are assigned different colors, then NT must be assigned
for regular state-space search can be solved quickly when formulated as a CSP.
the 3rd color, leaving no color for

6.1.2 SA.
Example problem: Job-shop scheduling
Intuition: Adding constraintsscheduling
without losing
the assembly solutions
of a car. The = ofobtaining
whole job is composed tasks, and we can modelaneachequivalent
network with a “tighter description”
as an integerand
numberhence
of minutes. with
Constraintsacansmaller
assert that onenumber
task must occur of beforeconsistent
variable assignments. so many tasks can go on at once. Constraints can also specify that a task takes a certain
We consider a small part of the car assembly, consisting of 15 tasks: install axles (front
Michael Kohlhase: Artificial Intelligence
X = {Axle 2 F , Axle B , Wheel RF , Wheel
269 LF , Wheel RB , Wheel LB2023-02-10
, Nuts RF ,
The value of each variable is the time that the task starts. Next we represent precedence
PRECEDENCE
T1 + d1 ≤ T2 .
169
170 CHAPTER 11. CONSTRAINT PROPAGATION
Illustration: Decomposition
Example 11.1.3. constraint
204 network γ: Chapter 6. Constraint Satisfaction Problems
NT
Q
Northern WA
Territory
Queensland
Western
Australia
SA NSW
South
Australia New
South V
Wales
Victoria
Tasmania
T
(a) (b)
We can separate this into two independent constraint

regions have the samenetworks.
region so that no neighboring color. (b) The map-coloring problem
Tasmania is not adjacent to any other

immediately state.
discard Thusof thewe
further refinements can
partial color
assignment. Australia
Furthermore, we can see first, and
assign an arbitrary color to Tasmania
focus attention afterwards.
on the variables that matter. As a result, many problems that are intractable

Decomposition methods exploit the structure of the constraint network. They
identify separate parts (sub-networks) whose inter-dependencies are “simple” and
can be handled efficiently. task as a variable, where the value of each variable is the time that the task starts, expressed
Example 11.1.4 (Extreme case). Noto complete.

inter-dependencies at all, as in our example
amount of time
here. and back), affix all four wheels (right and left, front and back), tighten nuts for each wheel,
X = {Axle F , Axle B , Wheel RF , Wheel LF , Wheel RB , Wheel LB , Nuts RF ,
TheIntelligence
Michael Kohlhase: Artificial value of each2 variable is the time that270
the task starts. Next we represent precedence
2023-02-10
PRECEDENCE
T1 + d1 ≤ T2 .

Inference: How does inference work in principle? What are relevant practical as-
pects?
Fundamental concepts underlying inference, basic facts about its use.

Forward checking: What is the simplest instance of inference?
Gets us started on this subject.
Arc consistency: How to make inferences between variables whose value is not fixed
yet?
Details a state of the art inference method.
Decomposition: constraint graphs, and two simple cases
How to capture dependencies in a constraint network? What are “simple cases”?
Basic results on this subject.
Cutset conditioning: What if we’re not in a simple case?
Outlines the most easily understandable technique for decomposition in the gen-
eral case.

11.2. INFERENCE 171
11.2 Inference
Inference: Basic Facts

Definition 11.2.1. Inference in constraint networks consists in deducing addi-
tional constraints, that follow from the already known constraints, i.e. that are
legalsatisfied in all solutions.
Example 11.2.2. It’s what you do all the time when playing SuDoKu:
Formally: Replace γ by an equivalent and strictly tighter constraint network γ ′ .
Equivalent Constraint Networks

Definition 11.2.3. We say that two constraint networks γ:=⟨V , D, C ⟩ and γ ′ :=⟨V , D′ , C ′ ⟩
sharing the same set of variables are equivalent, (write γ ′ ≡γ), if they have the same
solutions.
Example 11.2.4.
v1 v1
red red
γ γ′
blue blue
̸= ̸= ̸= ̸=
v2 red red v3 v2 red red v3

blue blue blue ̸= blue
Are these constraint networks equivalent? No.

v1 v1
red red
γ γ′
blue blue
̸= ̸= ̸= ̸=

blue blue blue = blue
Are these constraint networks equivalent? Yes.
Tightness
Definition 11.2.5 (Tightness). Let γ:=⟨V , D, C ⟩ and γ ′ = ⟨V γ ′ , Dγ ′ , C γ ′ ⟩ be
constraint networks sharing the same set of variables, then γ ′ is tighter than γ,
(write γ ′ ⊑γ), if:
(i) For all v∈V : Dv ⊆ Dv .
(ii) For all u ̸= v ∈ V and C uv ∈C γ ′ : either C uv ̸∈C or C uv ⊆ C uv .
γ ′ is strictly tighter than γ, (written γ ′ <γ), if at least one of these inclusions is
proper.
Example 11.2.6.
v1 v1
red red
γ γ′
blue blue
̸= ̸= ̸= ̸=

blue blue blue ̸= blue
Here, we do have γ ′ ⊑¶!.
v1 v1
red red
γ γ′
blue blue
̸= ̸= ̸= ̸=

Here, we do have γ ′ ⊑¶!.

11.2. INFERENCE 173
v1 v1
red red
γ γ′
blue blue
̸= ̸= ̸=

Here, we do not have γ ′ ⊑¶!.
b γ ′ has the same constraints as ¶!, plus some.

Intuition: Strict tightness =
Equivalence + Tightness = Inference

Theorem 11.2.7. Let γ and γ ′ be constraint networks such that γ ′ ≡γ and γ ′ ⊑γ.
Then γ ′ has the same solutions as, but fewer consistent assignments than, γ.
; γ ′ is a better encoding of the underlying problem.
Example 11.2.8.
v1 v1
red red
γ γ′
blue blue
̸= ̸= ̸= ̸=
v2 red blue v3 v2 red blue v3

=
a cannot be extended to a solution (neither in γ nor in γ ′ because they’re

equivalent). a is consistent with γ, but not with γ ′ .
How to Use Inference in CSP Solvers?

Simple: Inference as a pre process:
When: Just once before search starts.
Effect: Little running time overhead, little pruning power. Not considered here.
More Advanced: Inference during search:
When: At every recursive call of backtracking.

Effect: Strong pruning power, may have large running time overhead.
Search vs. Inference: The more complex the inference, the smaller the number
of search nodes, but the larger the running time needed at each node.
Idea: Encode variable assignments as unary constraints (i.e., for a(v) = d, set the
unary constraint Dv = {d}), so that inference reasons about the network restricted
to the commitments already made.
Backtracking With Inference

Definition 11.2.9. The general algorithm for backtracking with inference is
function BacktrackingWithInference(γ,a) returns a solution, or ‘‘inconsistent’’
if a is inconsistent then return ‘‘inconsistent’’
if a is a total assignment then return a
γ ′ := a copy of γ /∗ γ ′ = (V γ ′ , Dγ ′ , C γ ′ ) ∗/
γ ′ := Inference(γ ′ )
if exists v with Dv = ∅ then return ‘‘inconsistent’’
select some variable v for which a is not defined
for each d ∈ copy of Dv in some order do
a′ :=a ∪ {v = d}; Dv :={d} /∗ makes a explicit as a constraint ∗/
a′′ := BacktrackingWithInference(γ ′ ,a′ )
if a′′ ̸= “inconsistent” then return a′′
return ‘‘inconsistent’’
Exactly the same as Definition 10.4.2, only line 5 new!

Inference(): Any procedure delivering a (tighter) equivalent network.
Inference() typically prunes domains; indicate unsolvability by Dv = ∅.
When backtracking out of a search branch, retract the inferred constraints: these
were dependent on a, the search commitments so far.
11.3 Forward Checking

Forward
Forward Checking
Checking
I Inference, version 1: Forward Checking
Definition 11.3.1. Forward checking propagates information about illegal values:
function ForwardChecking( ,a) returns modified
Whenever a
for each variable is assigned
v whereu a(v by a, delete
) = d 0 is defined do all values inconsistent with a(u) from
every Dvforforeach
all uvariables v connected
where a(u) is undefinedwith Cuvby2aCconstraint.
and u do
Du := {d 2 Du | (d, d 0 ) 2 Cuv }
return
Example 11.3.2. Forward checking in Australia
I Example 3.1.
WA NT Q NSW V SA T

Forward Checking
for each v where a(v ) = d 0 is defined do
for each u where a(u) is undefined and Cuv 2 C do
Du := {d 2 Du | (d, d 0 ) 2 Cuv }
11.3. FORWARD CHECKING
return 175
IForward
Example Checking
3.1.

for each v where
WA a(v )NT= d 0 is defined
Q do
NSW V SA T
Du := {d 2 Du | (d, d 0 ) 2 Cuv }
return
I Example 3.1.
Forward Checking
function ForwardChecking(
Kohlhase: Künstliche NT 1 ,a) returns
WA Intelligenz Q modified
295
NSW V July 5, 2018 T
SA
Du := {d 2 Du | (d, d 0 ) 2 Cuv }
return
I Example 3.1.
WA NT Q NSW V SA T
Definition 11.3.3 (Inference, Version 1). Forward checking implemented

function ForwardChecking(γ,a) returns modified γ

for each v where a(v) = d′ is defined do
for each u where a(u) is undefined and Cuv ∈ C do
Du := {d∈Du |(d,d′ )∈C uv }
return γ
Note: It’s a bit strange that we start with d′ here; this is to make link to arc consistency –
coming up next – as obvious as possible (same notations u, and d vs. v and d′ ).
Forward Checking: Discussion

Definition 11.3.4. An inference procedure is called sound, iff for any input γ the
output γ ′ have the same solutions.
Lemma 11.3.5. Forward checking is sound
Proof sketch: Recall here that the assignment a is represented as unary constraints
inside γ.
Corollary 11.3.6. γ and γ ′ are equivalent.
Incremental computation: Instead of the first for loop in 0Definition 11.3.3, use
only the inner one every time a new assignment a(v) = d′ is added.
Practical Properties:
Cheap but useful inference method.

Rarely a good idea to not use forward checking (or a stronger inference method
subsuming it).
Up next: A stronger inference method (subsuming forward checking).
Definition 11.3.7. Let p and q be inference procedures, then p subsumes q, if

p(γ)⊑q(γ) for any input γ.
11.4 Arc Consistency

When Forward Checking is Not Good Enough I

Problem: Forward checking makes inferences only from assigned to unassigned
variables.
Example 11.4.1.
v1 v1 v1
1 1 1
v1 < v 2 v1 < v2 v1 < v 2
v2 1 2 3v 1 2 3 v3 v2 23 1 2 3 v3 v2 23 3 v3
When Forward Checking is Not Good Enough
2 < v3 v2 < v3 v2 < v3
Forward Checking
v1 v1 v1
I Inference, 1version 1: Forward Checking 1 1

function
v1 < v2 ForwardChecking( ,a) returns
v1 < v 2 modified v1 < v2
When Forward Checking is Not Good Enough II
v2 1 for
2 3 each u where 2 2 3 and Cuv1 22 3C do
v3 isvundefined
1 2 3 a(u) v3 v2 23 3 v3
Duv2:=
< v{d
3 2 Du | (d, d 0 ) 2 Cuv } v2 < v3 v 2 < v3
return
Example 11.4.2.
I Example 3.1.
WA NT Q NSW V SA T
WA NT Q NSW V SA T
;?
;?
Forward checking makes

Kohlhase: Künstliche inferences
Intelligenz 1 only “from
295 assignedJulyto5, unassigned”
2018 variables.
Kohlhase: Künstliche
MichaelIntelligenz 1
Kohlhase: Artificial Intelligence 2 297 281 July 5,2023-02-10
2018
11.4. ARC CONSISTENCY 177
Arc Consistency: Definition

Definition 11.4.3 (Arc Consistency). Let γ:=⟨V , D, C ⟩ be a constraint network.
(i) A variable u∈V is arc consistent relative to another variable v∈V if either C uv ̸∈C,
or for every value d∈Du there exists a value d′ ∈Dv such that (d,d′ )∈C uv .
(ii) The constraint network γ is arc consistent if every variable u∈V is arc consistent
relative to every other variable v∈V .
Intuition: Arc consistency = b for every domain value and constraint, at least one
value on the other side of the constraint “works”.
Note the asymmetry between u and v: arc consistency is directed.

Example 11.4.4 (Arc Consistency (previous slide)).
Question: On top, middle, is v 3 arc consistent relative to v 2 ?

Answer: No. For values 1 and 2, Dv2 does not have a value that works.
Question: And on the right?
Anser: Yes. (But v 2 is not arc consistent relative to v 3 )
Note: SA is not arc consistent relative to NT in Example 11.4.2, 3rd row.
Enforcing Arc Consistency: General Remarks

Inference, version 2: “Enforcing Arc Consistency” = removing domain values
until γ is arc consistent. (Up next)
Note: Assuming such an inference method AC(γ)
Lemma 11.4.5. AC(γ) is sound: guarantees to deliver an equivalent network.

Proof sketch: If, for d∈Du , there does not exist a value d′ ∈Dv such that (d,d′ )∈C uv ,
then u = d cannot be part of any solution.
Observation 11.4.6. AC(γ) subsumes forward checking: AC(γ)⊑ForwardChecking(γ).
Proof: Recall from slide 274 that γ ′ ⊑γ means γ ′ is tighter than γ.

1. Forward checking removes d from Du only if there is a constraint C uv such
that Dv = {d′ } (i.e. when v was assigned the value d′ ), and (d,d′ )̸∈C uv .
2. Clearly, enforcing arc consistency of u relative to v removes d from Du as well.
Enforcing Arc Consistency for One Pair of Variables

Definition 11.4.7 (Revise). An algorithm enforcing arc consistency of u relative
to v
function Revise(γ,u,v) returns modified γ

for each d∈Du do
if there is no d′ ∈Dv with (d,d′ )∈C uv then Du := Du \{d}
return γ
Lemma 11.4.8. If d is maximal domain size in γ and the test “(d,d′ )∈C uv ?” has
running time O(1), then the running time of Revise(γ, u, v) is O(d2 ).
Example 11.4.9. Revise(γ, v 3 , v 2 )
v1
v1 < v2
v2 23 122
333 v3
v2 < v3
AC-1: Enforcing Arc Consistency (Version 1)

Idea: Apply Revise pairwise up to a fixed point.
Definition 11.4.10.2 AC-1 enforces arc consistency in constraint networks:
function AC−1(γ) returns modified γ
repeat
changesMade := False
for each constraint C u0 v do
Revise(γ,u,v) /∗ if Du reduces, set changesMade := True ∗/
Revise(γ,v,u) /∗ if Dv reduces, set changesMade := True ∗/
until changesMade = False
return γ
Observation: Obviously, this does indeed enforce arc consistency for γ.

Lemma 11.4.11. If γ has n variables, m constraints, and maximal domain size d,
then the running time of AC1(γ) is O(md2 nd).
Proof sketch: O(md2 ) for each inner loop, fixed point reached at the latest once
all nd variable values have been removed.
Problem: There are redundant computations.
Question: Do you see what these redundant computations are?
Redundant computations: u and v are revised even if theirdomains haven’t

changed since the last time.
Better algorithm avoiding this: AC 3 (coming up)

11.4. ARC CONSISTENCY 179
AC-3: Enforcing Arc Consistency (Version 3)

Idea: Remember the potentially inconsistent variable pairs.
Definition 11.4.12. AC-3 optimizes AC-1 for enforcing arc consistency.
function AC−3(γ) returns modified γ
M := ∅
for each constraint C uv ∈C do
M := M ∪ {(u,v), (v,u)}
while M ̸= ∅ do
remove any element (u,v) from M
Revise(γ, u, v)
if Du has changed in the call to Revise then
for each constraint C wu ∈C where w ̸= v do
M := M ∪ {(w,u)}
return γ
Question: AC-3(γ) enforces arc consistency because?

Answer: At any time during the while-loop, if (u,v)̸∈M then u is arc consistent
relative to v.
Question: Why only “where w ̸= v”?
Answer: If w = v is the reason why Du changed, then w is still arc consistent

relative to u: the values just removed from Du did not match any values from Dw
anyway.
AC-3: Example
Example 11.4.13. y div x = 0: y modulo x is 0, i.e., y is divisible by x
M
(v 2 ,v 1 )
(v 1 ,v 2 )
(v 3 ,v 1 )
(v 1 ,v 3 )
M
(v 2 ,v 1 )
(v 1 ,v 2 )
(v 3 ,v 1 )
(v 1 ,v 3 )
M
(v 2 ,v 1 )
(v 1 ,v 2 )
v1
(v 3 ,v 1 )
225
M
(v 2 ,v 1 )
(v 1 ,v 2 )
v 2 div v 1 = 0 v 3 div v 1 = 0
M
v2 24 225 v3 (v 2 ,v 1 )
M
(v 2 ,v 1 )
(v 3 ,v 1 )
M
(v 2 ,v 1 )
(v 3 ,v 1 )
M
(v 2 ,v 1 )
AC-3: Runtime
11.5. DECOMPOSITION: CONSTRAINT GRAPHS, AND THREE SIMPLE CASES 181
Theorem 11.4.14 (Runtime of AC-3). Let γ:=⟨V , D, C ⟩ be a constraint net-

work with m constraints, and maximal domain size d. Then AC-3(γ) runs in time
O(md3 ).
Proof: by counting how often Revise is called.
1. Each call to Revise(γ, u, v) takes time O(d2 ) so it suffices to prove that at
most O(md) of these calls are made.
2. The number of calls to Revise(γ, u, v) is the number of iterations of the while-
loop, which is at most the number of insertions into M .
3. Consider any constraint C uv .
4. Two variable pairs corresponding to C uv are inserted in the for-loop. In the
while loop, if a pair corresponding to C uv is inserted into M , then
5. beforehand the domain of either u or v was reduced, which happens at most
2d times.
6. Thus we have O(d) insertions per constraint, and O(md) insertions overall, as
desired.
11.5 Decomposition: Constraint Graphs, and Three Simple

Cases
Reminder: The Big Picture

Say γ is a constraint network with n variables and maximal domain size d.
dn total assignments must be tested in the worst case to solve γ.
Inference: One method to try to avoid/ameliorate this explosion in practice.
Often, from an assignment to some variables, we can easily make inferences

regarding other variables.
Decomposition: Another method to avoid/ameliorate this explosion in practice.
Often, we can exploit the structure of a network to decompose it into smaller
parts that are easier to solve.
Question: What is “structure”, and how to “decompose”?
Problem structure
Tasmania and mainland are “independent subprob-

lems” 204 Chapter 6. Constraint Satisfaction Problems
Definition 11.5.1. Independent subproblems are

identified as connected components of constraint NT
Q
graphs. WA
Northern
Territory
Suppose each subproblem Australia
has c variablesQueensland
Western out of n SA NSW
total South
Australia New
South V
Wales
Worst-case solution cost is n div c · dc (linear
Victoriain n)
T
E.g., n = 80, d = 2, c = 20 Tasmania
(a) (b)
b 4 billion years at
280 = 106.1million
Figure (a) The nodes/sec
principal states and territories of Australia. Coloring this map can
b 0.4 seconds at 10 million nodes/sec
4220 = region so that no neighboring regions have the same color. (b) The map-coloring problem
Michael Kohlhase: immediately discard

Artificial Intelligence further refinements
2 290of the partial assignment. Furthermore, we can see
2023-02-10
“Decomposition” 1.0: Disconnected
6.1.2 Example problem: Constraint
Job-shop scheduling Graphs
Theorem 11.5.2 (Disconnected
scheduling theConstraint Graphs).
assembly of a car. The Let γ:=⟨V
whole job is composed , D,
of tasks, and can⟩model
we C be each
a
constraint network. Let ai task
beasaa Svariable,
solution whereto
the value
eachof each variable is the
connected time that the task γ
component starts, expressed
i of the
constraint graph of γ. Thenanother—for
a:= i aexample,
i is aa solution
wheel must beto γ. before the hubcap is put on—and that only
installed
Proof: amount of time to complete.
1. a satisfies all C uv where u and
and back), affixvallare inside(right
four wheels theandsame
left, frontconnected component.
and back), tighten nuts for each wheel,
2. The latter is the case for all X C = uv . F , Axle B , Wheel RF , Wheel LF , Wheel RB , Wheel LB , Nuts RF ,
{Axle
3. If two parts of γ are not connected, Nuts LF , then
Nuts RB they
, Nuts LBare
, Capindependent.
RF , Cap LF , Cap RB , Cap LB , Inspect } .
PRECEDENCE
constraints between individual tasks. Whenever a task T1 must occur before task T2 , and
204 Example 11.5.3.CONSTRAINTS
Color Tasmania
task T1 takes
Chapter 6. separately
duration in Problems
Australia
to complete,
Constraintd1Satisfaction we add an arithmetic constraint of the form
T1 + d1 ≤ T2 .
NT
Q
Northern WA
Territory
Queensland
Western
Australia
SA NSW
South
Australia New
South V
Wales
Victoria
Tasmania
T
(a) (b)
Example 11.5.4
be viewed (Doing
as a constraint the(CSP).
satisfaction problem Numbers).
The goal is to assign colors to each
γ with n = 40 variables, each domain size k = 2. Four separate connected
immediately discard further refinements of the partial assignment. Furthermore, we can see
components each
why the assignment is not of size
a solution—we 10. variables violate a constraint—so we can
see which
Reduction ofsearch
for regular state-space worst-case when
can be solved quickly whenusing decomposition:
formulated as a CSP.

No decomposition: 2 . With: 4 · 210 . Gain: 228 ≊ 280.000.000.
40
as an integer number ofMichael
minutes.Kohlhase: Artificial
Constraints Intelligence
can assert 2 task must occur before
that one 291 2023-02-10
Tree-structured CSPs We consider a small part of the car assembly, consisting of 15 tasks: install axles (front
PRECEDENCE
T1 + d1 ≤ T2 .
Theorem 11.5.5. If the constraint graph has no cycles, the CSP can be solved in
O(nd2 ) time.
Compare to general CSPs, where worst case time is O(dn ).
This property also applies to logical and probabilistic reasoning: an important ex-
ample of the relation between syntactic restrictions and the complexity of reasoning.
Algorithm for tree-structured CSPs

1. Choose a variable as root, order variables from root to leaves such that every node’s
parent precedes it in the ordering
2. For j from n down to 2, apply

RemoveInconsistent(Parent(Xj ,Xj )
3. For j from 1 to n, assign Xj consistently with P arent(Xj )
Nearly tree-structured CSPs

Definition 11.5.6. Conditioning: instantiate a variable, prune its neighbors’do-
mains.
Example 11.5.7.
Definition 11.5.8. Cutset conditioning: instantiate (in all ways) a set of variables
such that the remaining constraint graph is a tree.
Cutset size c ; running time O(dc (n − c)d2 ), very fast for small c.
“Decomposition” 2.0: Acyclic Constraint Graphs

Theorem 11.5.9 (Acyclic Constraint Graphs). Let γ:=⟨V , D, C ⟩ be a constraint
network with n variables and maximal domain size k, whose constraint graph is
acyclic. Then we can find a solution for γ, or prove γ to be unsatisfiable, in time
O(nk 2 ).
Proof sketch: See the algorithm on the next slide
Constraint networks with acyclic constraint graphs can be solved in (low order)
PTIMEpolynomial time.
204 Chapter 6. Constraint Satisfaction Problems
Example 11.5.10. Australia is not acyclic. (But see next section)
NT
Q
Northern WA
Territory
Queensland
Western
Australia
SA NSW
South
Australia New
South V
Wales
Victoria
Tasmania
T
(a) (b)
Example 11.5.11 (Doing the Numbers).

immediately discard further refinements of the partial γ with n = 40 variables, each domain size k = 2. Acyclic constraint graph.
assignment. Furthermore, we can see
for regular state-space search can be solved quickly Reduction of worst-case when using decomposition:
when formulated as a CSP.
6.1.2 Example problem: Job-shop scheduling No decomposition: 240 . With decomposition: 40 · 22 . Gain: 232 .
PRECEDENCE
T +d ≤T .
Acyclic Constraint Graphs: How To

Definition 11.5.12.
Algorithm AcyclicCG(γ):
1. Obtain a directed tree from γ’s constraint graph, picking an arbitrary variable v
as the root, and directing arcs outwards.a
2. Order the variables topologically, i.e., such that each vertex is ordered before its
children; denote that order by v 1 , . . ., v n .
3. for i := n, n − 1, . . . , 2 do:
(a) Revise(γ, v parent(i) , v i ).
(b) if Dvparent(i) = ∅ then return “inconsistent”
Now, every variable is arc consistent relative to its children.
4. Run BacktrackingWithInference with forward checking, using the variable order
v 1 , . . ., v n .
Lemma 11.5.13. This algorithm will find a solution without ever having to back-
track!
a We assume here that γ’s constraint graph is connected. If it is not, do this and the following
for each component separately.
AcyclicCG(γ): Example
Example 11.5.14 (AcyclicCG() execution).
v1
13
12
v1 < v2
v2 11223 33
12 v3
v2 < v 3
Input network γ. Step 1: Directed tree for root v 1 .

Step 2: Order v 1 , v 2 , v 3 .
Step 3: After Revise(γ, v 2 , v 3 ).
Step 3: After Revise(γ, v 1 , v 2 ).
Step 4: After a(v 1 ) := 1
and forward checking.
and forward checking.
(and forward checking).

11.6 Cutset Conditioning

“Almost” Acyclic Constraint Graphs

Example 11.6.1 (Coloring Australia).
Cutset Conditioning: Idea:

1. Recursive call of backtracking on a s.t. the sub-graph of the constraint graph
induced by {v∈V |a(v) is undefined} is acyclic.
Then we can solve the remaining sub-problem with AcyclicCG().
2. Choose the variable order so that removing the first d variables renders the con-
straint graph acyclic.
Then with (1) we won’t have to search deeper than d . . . !
“Decomposition” 3.0: Cutset Conditioning

Definition 11.6.2 (Cutset). Let γ:=⟨V , D, C ⟩ be a constraint network, and
V0 ⊆ V . Then V0 is a cutset for γ if the subgraph of γ’s constraint graph induced
by V \V0 is acyclic. V0 is called optimal if its size is minimal among all cutsets for
γ.
Definition 11.6.3. The cutset conditioning algorithm, computes an optimal cutset,
from γ and an existing cutset V0 .
function CutsetConditioning(γ,V0 ,a) returns a solution, or ‘‘inconsistent’’
γ ′ := a copy of ¶!; γ ′ := ForwardChecking(γ ′ ,a)
if ex. v with Dv = ∅ then return ‘‘inconsistent’’
if ex. v∈V0 s.t. a(v) is undefined then select such v
else a′ := AcyclicCG(γ ′ ); if a′ ̸= “inconsistent” then return a ∪ a′ else return ‘‘inconsistent’’
for each d∈copy of Dv in some order do
a′ := a ∪ {v = d}; Dv := {d};
a′′ := CutsetConditioning(γ ′ ,V0 ,a′ )
if a′′ ̸= “inconsistent” then return a′′ else return ‘‘inconsistent’’
Forward checking is required so that “a ∪ AcyclicCG(γ ′ )” is consistent in γ.

Observation 11.6.4. Running time is exponential only in #(V0 ), not in #(V )!
Remark 11.6.5. Finding optimal cutsets is NP hard, but approximations exist.

11.7. CONSTRAINT PROPAGATION WITH LOCAL SEARCH 187
11.7 Constraint Propagation with Local Search

Iterative algorithms for CSPs

Local search algorithms like hill climbing and simulated annealing typically work
with “complete” states, i.e., all variables are assigned
To apply to CSPs: allow states with unsatisfied constraints, actions reassign variable
values.
Variable selection: randomly select any conflicted variable.
Value selection: by min conflicts heuristic: choose value that violates the fewest
constraints i.e., hill climb with h(n):=total number of violated constraints
Example: 4-Queens
States: 4 queens in 4 columns (44 = 256 states)
Actions: move queen in column
Goal state: no attacks

Heuristic: h(n) =
b number of attacks
Performance of min-conflicts
Given random initial state, can solve n-queens in almost constant time for arbitrary
n with high probability (e.g., n = 10,000,000)
The same appears to be true for any randomly-generated CSP except in a narrow
range of the ratio
number of constraints
R=
number of variables
11.8 Conclusion & Summary

Conclusion & Summary

γ and γ ′ are equivalent if they have the same solutions. γ ′ is tighter than γ if it is
more constrained.
Inference tightens γ without losing equivalence, during backtracking. This reduces
the amount of search needed; that benefit must be traded off against the running
time overhead for making the inferences.
Forward checking removes values conflicting with an assignment already made.
Arc consistency removes values that do not comply with any value still available at
the other end of a constraint. This subsumes forward checking.
The constraint graph captures the dependencies between variables. Separate con-
nected components can be solved independently. Networks with acyclic constraint
graphs can be solved in low order polynomial time.
A cutset is a subset of variables removing which renders the constraint graph acyclic.
Cutset decomposition backtracks only on such a cutset, and solves a sub-problem
with acyclic constraint graph at each search leaf.
Topics We Didn’t Cover Here

Path consistency, k-consistence: Generalizes arc consistency to size k subsets of
variables. Path consistency =
b 3-consistency.
Tree decomposition: Instead of instantiating variables until the leaf nodes are
trees, distribute the variables and constraints over sub CSPs whose connections
form a tree.
Backjumping: Like backtracking, but with ability to back up across several levels
11.8. CONCLUSION & SUMMARY 189
(to a previous assignment identified to be responsible for failure).

No-Good Learning: Inferring additional constraints based on information gathered
during backtracking.
Local search: In space of total (but not necessarily consistent) assignments.

(E.g., 8 Queens in chapter 8)
Tractable CSP: Classes of CSPs that can be solved in P.
Global Constraints: Constraints over many/all variables, with associated special-
ized inference methods.
Constraint Optimization Problems (COP): Utility function over solutions, need

an optimal one.
Suggested Reading:
• Chapter 6: Constraint Satisfaction Problems in [RN09], in particular Sections 6.2, 6.3.2, and
6.5.
– Compared to our treatment of the topic “Constraint Satisfaction Problems” (chapter 10 and
chapter 11), RN covers much more material, but less formally and in much less detail (in par-
ticular, my slides contain many additional in-depth examples). Nice background/additional
reading, can’t replace the lecture.
– Section 6.3.2: Somewhat comparable to my “Inference” (except that equivalence and tightness
are not made explicit in RN) together with “Forward Checking”.
– Section 6.2: Similar to my “Arc Consistency”, less/different examples, much less detail, addi-
tional discussion of path consistency and global constraints.
– Section 6.5: Similar to my “Decomposition: Constraint Graphs, and Two Simple Cases” and
“Cutset Conditioning”, less/different examples, much less detail, additional discussion of tree
decomposition.
Part III
Knowledge and Inference
191
193
A Video Nugget covering this part can be found at https://fau.tv/clip/id/22466.

This part of the course introduces representation languages and inference methods for structured
state representations for agents: In contrast to the atomic and factored state representations from
Part II, we look at state representations where the relations between objects are not determined
by the problem statement, but can be determined by inference-based methods, where the knowl-
edge about the environment is represented in a formal langauge and new knowledge is derived by
transforming expressions of this language.
We look at propositional logic – a rather weak representation langauge – and first-order logic
– a much stronger one – and study the respective inference procedures. In the end we show that
computation in ProLog is just an inference problem as well.
194
Chapter 12
Propositional Logic & Reasoning,

Part I: Principles
12.1 Introduction
The Wumpus World
Definition 12.1.1. The Wumpus world

is a simple game where an agent explores
a cave with 16 cells that can contain pits,
gold, and the Wumpus with the goal of
getting back out alive with the gold.
Definition 12.1.2 (Actions). The agent can perform the following actions: goForward,
turnRight (by 90◦ ), turnLeft (by 90◦ ), shoot arrow in direction you’re facing (you
got exactly one arrow), grab an object in current cell, leave cave if you’re in cell
[1, 1].
Definition 12.1.3 (Initial and Terminal States). Initially, the agent is in cell
[1, 1] facing east. If the agent falls down a pit or meets live Wumpus it dies.
Definition 12.1.4 (Percepts). The agent can experience the following percepts:
stench, breeze, glitter, bump, scream, none.
Cell adjacent (i.e. north, south, west, east) to Wumpus: stench (else: none).
Cell adjacent to pit: breeze (else: none).
Cell that contains gold: glitter (else: none).
You walk into a wall: bump (else: none).
Wumpus shot by arrow: scream (else: none).
195
196 CHAPTER 12. PROPOSITIONAL LOGIC & REASONING, PART I: PRINCIPLES
Reasoning in the Wumpus World
Example 12.1.5 (Reasoning in the Wumpus World).

A: agent, V: visited, OK: safe, P: pit, W: Wumpus, B: breeze, S: stench, G: gold.
(1) Initial state (2) One step to right (3) Back, and up to [1,2]
The Wumpus is in [1,3]! How do we know?

No stench in [2,1], so the stench in [1,2] can only come from [1,3].
There’s a pit in [3,1]! How do we know?
No breeze in [1,2], so the breeze in [2,1] can only come from [3,1].
Agents that Think Rationally

Think Before You Act!: A model based agent can think about the world and its
actions.
Sensors
State
is like now
Environment
What my actions do

should do now
Agent Actuators

function KB−AGENT (percept) returns an action
persistent:function
KB,MaODEL knowledge
-BASED -R EFLEX base
-AGENT( percept ) returns an action
a counter,
t,persistent: state, theinitially 0,conception
agent’s current indicating time
of the world state
TELL(KB, MAKE−PERCEPT−SENTENCE(percept,t))
action := ASK(KB, MAKE−ACTION−QUERY(t))
TELL(KB, MAKE−ACTION−SENTENCE(action,t))
state ← U PDATE -S TATE(state, action , percept , model )
t := t+1 rule ← RULE -M ATCH(state, rules)
return action return action
Idea: “Thinking”
using an= Reasoning
internal model. It thenabout
chooses anknowledge
action in the samerepresented using logic.
way as the reflex agent.
states areMichael
represented vary
Kohlhase: widely
Artificial depending
Intelligence 2 on the type of 307
environment and the2023-02-10
particular
technology used in the agent design. Detailed examples of models and updating algorithms
appear in Chapters 4, 12, 11, 15, 17, and 25.
Regardless of the kind of representation used, it is seldom possible for the agent to
determine the current state of a partially observable environment exactly. Instead, the box
labeled “what the world is like now” (Figure 2.11) represents the agent’s “best guess” (or
sometimes best guesses). For example, an automated taxi may not be able to see around the
large truck that has stopped in front of it and can only guess about what may be causing the
Logic: Basic Concepts (Representing Knowledge)

Definition 12.1.6. Syntax: What are legal statements (formulae) A in the logic?
Example 12.1.7. “W ” and “W ⇒ S”. b Wumpus is here, S =
(W = b it stinks)
Definition 12.1.8. Semantics: Which formulas A are true under which assignment
φ, written φ|=A?
Example 12.1.9. If φ:={W 7→T, S7→F}, then φ|=W but φ̸|=W ⇒ S.

Intuition: Knowledge about the state of the world is described by formulae,
interpretations evaluate them in the current world (they should turn out true!)
Logic: Basic Concepts (Reasoning about Knowledge)

Definition 12.1.10. Entailment: Which B are entailed by A, written A |= B,
meaning that, for all φ with φ|=A, we have φ|=B? E.g., P ∧ (P ⇒ Q) |= Q.
Intuition: Entailment = b ideal outcome of reasoning, everything that we can
possibly conclude. e.g. determine Wumpus position as soon as we have enough
information
Definition 12.1.11. Deduction: Which statements B can be derived from A using
a set C of inference rules (a calculus), written A⊢C B?
A A⇒B
Example 12.1.12. If C contains then P, P ⇒ Q⊢C Q
B
Intuition: Deduction =
b process in an actual computer trying to reason about
entailment. E.g. a mechanical process attempting to determine Wumpus position.
Definition 12.1.13. Soundness: whenever A⊢C B, we also have A |= B.
Definition 12.1.14. Completeness: whenever A |= B, we also have A⊢C B.
General Problem Solving using Logic

Idea: Any problem that can be formulated as reasoning about logic. ; use
off-the-shelf reasoning tool.
Very successful using propositional logic and modern SAT solvers! (Propositional
satisfiability testing; chapter 15)

Propositional Logic and Its Applications

Propositional logic = canonical form of knowledge + reasoning.
Syntax: Atomic propositions that can be either true or false, connected by “and,
or, not”.
Semantics: Assign value to every proposition, evaluate connectives.
Applications: Despite its simplicity, widely applied!

Product configuration (e.g., Mercedes). Check consistency of customized
combinations of components.
Hardware verification (e.g., Intel, AMD, IBM, Infineon). Check whether a
circuit has a desired property p.
Software verification: Similar.
CSP applications: propositional logic can be (successfully!) used to formulate
and solve constraint satisfaction problems. (see chapter 10)
chapter 11 gives an example for verification.

This section: Basic definitions and concepts; tableaux, resolution.
Sets up the framework. Resolution is the quintessential reasoning procedure
underlying most successful SAT solvers.
chapter 15: The Davis Putnam procedure and clause learning; practical problem
structure.
State-of-the-art algorithms for reasoning about propositional logic, and an im-
portant observation about how they behave.

Propositional logic: What’s the syntax and semantics? How can we capture de-
duction?
We study this logic formally.
Tableaux, Resolution: How can we make deduction mechanizable? What are its
properties?
Formally introduces the most basic machine-oriented reasoning methods.
Killing a Wumpus: How can we use all this to figure out where the Wumpus is?
12.2. PROPOSITIONAL LOGIC (SYNTAX/SEMANTICS) 199
Coming back to our introductory example.
12.2 Propositional Logic (Syntax/Semantics)

Propositional Logic (Syntax)
Definition 12.2.1 (Syntax). The formulae of propositional logic (write PL0 ) are
made up from
propositional variables: V0 :={P , Q, R, P 1 , P 2 , . . .} (countably infinite)
constants/constructors called connectives: Σ0 :={T , F , ¬, ∨, ∧, ⇒, ⇔, . . .}
We define the set wff0 (V0 ) of well-formed propositional formula (wffs) as

propositional variables,
the logical constants T and F ,
negations ¬A,
conjunctions A ∧ B(A and B are called conjuncts),
disjunctions A ∨ B (A and B are called disjuncts),
implications A ⇒ B, and
equivalences (or biimplication). A ⇔ B,
where A, B∈wff0 (V0 ) themselves.

Example 12.2.2. P ∧ Q, P ∨ Q, (¬P ∨ Q) ⇔ (P ⇒ Q)∈wff0 (V0 )
Definition 12.2.3. Propositional formulae without connectives are called atomic
(or an atom) and complex otherwise.
Propositional Logic Grammar Overview

Grammar for Propositional Logic:
propositional variables X ::= V0 = {P , Q, R, . . . , . . .} variables

propositional formulae A ::= X variable
| ¬A negation
| A1 ∨ A2 disjunction
| A1 ⇒ A2 implication
| A1 ⇔ A2 equivalence
Alternative Notations for Connectives

Here Elsewhere
¬A ∼A A
A∧B A&B A•B A, B
A∨B A+B A|B A;B
A⇒B A→B A⊃B
A⇔B A↔B A≡B
F ⊥ 0
T ⊤ 1
Semantics of PL0 (Models)

Definition 12.2.4. A model M:=⟨Do , I⟩ for propositional logic consists of
the universe Do = {T, F}
the interpretation I that assigns values to essential connectives.
I(¬) : Do →Do ; T7→F, F7→T
I(∧) : Do × Do →Do ; ⟨α, β⟩7→T, iff α = β = T
We call a constructor a logical constant, iff its value is fixed by the interpretation
Treat the other connectives as abbreviations, e.g. A ∨ B= b ¬(¬A ∧ ¬B) and
A ⇒ B= b ¬A ∨ B, and T =b P ∨ ¬P (only need to treat ¬, ∧ directly)
Semantics of PL0 (Evaluation)

Problem: The interpretation function only assigns meaning to connectives.
Definition 12.2.5. A variable assignment φ : V0 →Do assigns values to proposi-
tional variables.
Definition 12.2.6. The value function I φ : wff0 (V0 )→Do assigns values to PL0
formulae. It is recursively defined,
I φ (P ) = φ(P ) (base case)
I φ (¬A) = I(¬)(I φ (A)).
I φ (A ∧ B) = I(∧)(I φ (A), I φ (B)).
12.2. PROPOSITIONAL LOGIC (SYNTAX/SEMANTICS) 201
Note that I φ (A∨B) = I φ (¬(¬A∧¬B)) is only determined by I φ (A) and I φ (B),

so we think of the defined connectives as logical constants as well.
Definition 12.2.7. Two formulae A and B are called equivalent, iff I φ (A) =
I φ (B) for all variable assignments φ.
Computing Semantics
Example 12.2.8. Let φ:=[T/P 1 ], [F/P 2 ], [T/P 3 ], [F/P 4 ], . . . then
I φ (P 1 ∨ P 2 ∨ ¬(¬P 1 ∧ P 2 ) ∨ P 3 ∧ P 4 )
= I(∨)(I φ (P 1 ∨ P 2 ), I φ (¬(¬P 1 ∧ P 2 ) ∨ P 3 ∧ P 4 ))
= I(∨)(I(∨)(I φ (P 1 ), I φ (P 2 )), I(∨)(I φ (¬(¬P 1 ∧ P 2 )), I φ (P 3 ∧ P 4 )))
= I(∨)(I(∨)(φ(P 1 ), φ(P 2 )), I(∨)(I(¬)(I φ (¬P 1 ∧ P 2 )), I(∧)(I φ (P 3 ), I φ (P 4 ))))
= I(∨)(I(∨)(T, F), I(∨)(I(¬)(I(∧)(I φ (¬P 1 ), I φ (P 2 ))), I(∧)(φ(P 3 ), φ(P 4 ))))
= I(∨)(T, I(∨)(I(¬)(I(∧)(I(¬)(I φ (P 1 )), φ(P 2 ))), I(∧)(T, F)))
= I(∨)(T, I(∨)(I(¬)(I(∧)(I(¬)(φ(P 1 )), F)), F))
= I(∨)(T, I(∨)(I(¬)(I(∧)(I(¬)(T), F)), F))
= I(∨)(T, I(∨)(I(¬)(I(∧)(F, F)), F))
= I(∨)(T, I(∨)(I(¬)(F), F))
= I(∨)(T, I(∨)(T, F))
= I(∨)(T, T)
= T
What a mess!
Now we will also review some propositional identities that will be useful later on. Some of them we
have already seen, and some are new. All of them can be proven by simple truth table arguments.
Propositional Identities
We have the following identities in propositional logic:
Name for ∧ for ∨

Idenpotence φ∧φ=φ φ∨φ=φ
Identity φ∧T =φ φ∨F =φ
Absorption I φ∧F =F φ∨T =T
Commutativity φ∧ψ =ψ∧φ φ∨ψ =ψ∨φ
Associativity φ∧ψ∧θ =φ∧ψ∧θ φ∨ψ∨θ =φ∨ψ∨θ
Distributivity φ ∧ (ψ ∨ θ) = φ ∧ ψ ∨ φ ∧ θ φ ∨ ψ ∧ θ = (φ ∨ ψ) ∧ (φ ∨ θ)
Absorption II φ ∧ (φ ∨ θ) = φ φ∨φ∧θ =φ
De Morgan ¬(φ ∧ ψ) = ¬φ ∨ ¬ψ ¬(φ ∨ ψ) = ¬φ ∧ ¬ψ
Double negation ¬¬φ = φ
Definitions φ ⇒ ψ = ¬φ ∨ ψ φ ⇔ ψ = (φ ⇒ ψ) ∧ (ψ ⇒ φ)
We will now use the distribution of values of a Boolean expression under all (variable) assignments
to characterize them semantically. The intuition here is that we want to understand theorems,
examples, counterexamples, and inconsistencies in mathematics and everyday reasoning1 .
The idea is to use the formal language of Boolean expressions as a model for mathematical
language. Of course, we cannot express all of mathematics as Boolean expressions, but we can at
least study the interplay of mathematical statements (which can be true or false) with the copula
“and”, “or” and “not”.
Semantic Properties of Propositional Formulae

Definition 12.2.9. Let M:=⟨U, I⟩ be our model, then we call A
true under φ (φ satisfies A) in M, iff I φ (A) = T (write M|=φ A)
false under φ (φ falsifies A) in M, iff I φ (A) = F (write M̸|=φ A)
satisfiable in M, iff I φ (A) = T for some assignment φ
valid in M, iff M|=φ A for all assignments φ
falsifiable in M, iff I φ (A) = F for some assignments φ
unsatisfiable in M, iff I φ (A) = F for all assignments φ
Example 12.2.10. x ∨ x is satisfiable and falsifiable.
Example 12.2.11. x ∨ ¬x is valid and x ∧ ¬x is unsatisfiable.

Alternative Notation: Write [ A]]φ for I φ (A), if M = ⟨U, I⟩. (and [ A]], if A is
ground, and [ A]], if M is clear)
Definition 12.2.12 (Entailment). (aka. logical consequence)
We say that A entails B (A |= B), iff I φ (B) = T for all φ with I φ (A) = T (i.e.
all assignments that make A true also make B true)
Let us now see how these semantic properties model mathematical practice.
In mathematics we are interested in assertions that are true in all circumstances. In our model
of mathematics, we use variable assignments to stand for circumstances. So we are interested
in Boolean expressions which are true under all variable assignments; we call them valid. We
often give examples (or show situations) which make a conjectured assertion false; we call such
examples counterexamples, and such assertions “falsifiable”. We also often give examples for certain
assertions to show that they can indeed be made true (which is not the same as being valid
yet); such assertions we call “satisfiable”. Finally, if an assertion cannot be made true in any
circumstances we call it “unsatisfiable”; such assertions naturally arise in mathematical practice in
the form of refutation proofs, where we show that an assertion (usually the negation of the theorem
we want to prove) leads to an obviously unsatisfiable conclusion, showing that the negation of the
theorem is unsatisfiable, and thus the theorem valid.
A better mouse-trap: Truth Tables
1 Here (and elsewhere) we will use mathematics (and the language of mathematics) as a test tube for under-
standing reasoning, since mathematics has a long history of studying its own reasoning processes and assumptions.
12.3. PREDICATE LOGIC WITHOUT QUANTIFIERS 203
Truth tables visualize truth functions:

¬ ∧ ⊤ ⊥ ∨ ⊤ ⊥
⊤ F ⊤ T F ⊤ T T
⊥ T ⊥ F F ⊥ T F
If we are interested in values for all assignments (e.g z ∧ x ∨ ¬(z ∧ y))
assignments intermediate results full

x y z e1 :=z ∧ y e2 :=¬e1 e3 :=z ∧ x e3 ∨ e2
F F F F T F T
F F T F T F T
F T F F T F T
F T T T F F F
T F F F T F T
T F T F T T T
T T F F T F T
T T T T F T T
Hair Color in Propositional Logic

There are three persons, Stefan, Nicole, and Jochen.
1. Their hair colors are black, red, or green.

2. Their study subjects are AI, Physics, or Chinese at least one studies AI.
(a) Persons with red or green hair do not study AI.
(b) Neither the Physics nor the Chinese students have black hair.
(c) Of the two male persons, one studies Physics, and the other studies Chinese.
Question: Who studies AI?
(A) Stefan (B) Nicole (C) Jochen (D) Nobody
Answer: You can solve this using PL0 , if we accept bla(S), etc. as propositional variables.
We first express what we know: For every x∈{S, N , J} (Stefan, Nicole, Jochen) we have
1. bla(x) ∨ red(x) ∨ gre(x); (note: three formulae)

2. ai(x) ∨ phy(x) ∨ chi(x) and ai(S) ∨ ai(N ) ∨ ai(J)
(a) ai(x) ⇒ ¬red(x) ∧ ¬gre(x).
(b) phy(x) ⇒ ¬bla(x) and chi(x) ⇒ ¬bla(x).
(c) phy(S) ∧ chi(J) ∨ phy(J) ∧ chi(S).
Now, we obtain new knowledge via entailment steps:
3. 1. together with 2.1 entails that ai(x) ⇒ bla(x) for every x∈{S, N , J},
4. thus ¬bla(S) ∧ ¬bla(J) by 3. and 2.2 and
5. so ¬ai(S) ∧ ¬ai(J) by 3. and 4.
6. With 2.3 the latter entails ai(N ).
12.3 Predicate Logic Without Quantifiers

In the hair-color example we have seen that we are able to model complex situations in PL0 .
The trick of using variables with fancy names like bla(N ) is a bit dubious, and we can already
imagine that it will be difficult to support programmatically unless we make names like bla(N )
into first class citizens i.e. expressions of the logic language themselves.
Individuals and their Properties/Relations

Observation: We want to talk about individuals like Stefan, Nicole, and Jochen
and their properties, e.g. being blond, or studying AI
and relationships, e.g. that Stefan loves Nicole.
Idea: Re-use PL0 , but replace propositional variables with something more expres-
sive! (instead of fancy variable name
trick)
Definition 12.3.1. A first-order signature consists of pairwise disjoint, countable
sets for each k∈N
function constants: Σfk = {f , g, h, . . .} – denoting functions on individuals

predicate constants: Σpk = {p, q, r, . . .} – denoting relationships among individ-
uals.
S S
We set Σf := k∈N Σfk , Σp := k∈N Σpk , and Σ1 :=Σf ∪ Σp .
Definition 12.3.2.
The formulae of PLnq are given by the following grammar
functions fk ∈ Σfk
predicates pk ∈ Σpk
terms t ::= X variable
| f0 constant
| f k (t1 , . . ., tk ) application
formulae A ::= pk (t1 , . . ., tk ) atomic
| ¬A negation
PLNQ Semantics
Definition 12.3.3. Universes Do = {T, F} of truth values and Dι ̸= ∅ of individu-
als.
Definition 12.3.4. Interpretation I assigns values to constants, e.g.

I(¬) : Do →Do ; T7→F; F7→T and I(∧) = . . . (as in PL0 )
I : Σf0 →Dι (interpret individual constants as individuals)
I: Σfk →Dι k → Dι (interpret function constants as functions)mo
I: Σpk →P(Dι k ) (interpret predicates as arbitrary relations)
Definition 12.3.5.
12.3. PREDICATE LOGIC WITHOUT QUANTIFIERS 205
The value function I assigns values to formulae (recursively)
I(f (A1 , . . ., Ak )):=I(f )(I(A1 ), . . . , I(Ak ))

I(p(A1 , . . ., Ak )):=T, iff ⟨I(A1 ), . . . , I(Ak )⟩∈I(p)
I(¬A) = I(¬)(I(A)) and I(A ∧ B) = I(∧)(I(A), I(G)) (just as in PL0 )
Definition 12.3.6. Model: M = ⟨Dι , I⟩ varies in Dι and I.

Theorem 12.3.7. PLnq is isomorphic to PL0 (interpret atoms as prop. variables)
A Model for PLnq

Example 12.3.8. Let L:={a, b, c, d, e, P , Q, R, S}, we set the domain D:={♣, ♠, ♡, ♢},
and the interpretation function I by setting
a7→♣, b7→♠, c7→♡, d7→♢, and e7→♢ for individual constants,

P 7→{♣, ♠} and Q7→{♠, ♢}, for unary predicate constants.
R7→{⟨♡, ♢⟩, ⟨♢, ♡⟩}, and S7→{⟨♢, ♠⟩, ⟨♠, ♣⟩} for binary predicate constants.
Example 12.3.9 (Computing Meaning in this Model).
I(R(a, b) ∧ P (c)) = T, iff

I(R(a, b)) = T and I(P (c)) = T, iff
⟨I(a), I(b)⟩∈I(R) and I(c)∈I(P ), iff
⟨♣, ♠⟩∈{⟨♡, ♢⟩, ⟨♢, ♡⟩} and ♡∈{♣, ♠}
So, I(R(a, b) ∧ P (c)) = F.
PLnq and PL0 are Isomorphic

Observation: For every choice of Σ of signature, the set AΣ of atomic PLnq
formulae is countable, so there is a V Σ ⊆ V0 and a bijection θΣ : AΣ →V Σ .
θΣ can be extended to formulae as PLnq and PL0 share connectives.
Lemma 12.3.10. For every model M = ⟨Dι , I⟩, there is a variable assignment
φM , such that I φM (A) = I(A).
Proof sketch: We just define φM (X):=I(θ−1

Σ (X))
Lemma 12.3.11. For every variable assignment ψ : V Σ →{T, F} there is a model

Mψ = ⟨Dψ , I ψ ⟩, such that I ψ (A) = I ψ (A).
Proof sketch: see next slide
Corollary 12.3.12. PLnq is isomorphic to PL0 , i.e. the following diagram commutes:
ψ 7→ Mψ
⟨Dψ , I ψ ⟩ V Σ → {T, F}
I ψ () I φM ()
θΣ
PLnq (Σ1 ) PL0 (AΣ )
Note: This constellation with a language isomorphism and a corresponding model

isomorphism (in converse direction) is typical for a logic isomorphism.
Valuation and Satisfiability

Lemma 12.3.13. For every variable assignment ψ : V Σ →{T, F} there is a model
Mψ = ⟨Dψ , I ψ ⟩, such that I ψ (A) = I ψ (A).
Proof: We construct Mψ = ⟨Dψ , I ψ ⟩ and show that it works as desired.
1. Let Dψ be the set of PLnq terms over Σ, and
k ψk
ψ
I (f ) : Dι →D ;⟨A1 , . . ., Ak ⟩7→f (A1 , . . ., Ak ) for f ∈Σf k
−1
I (p):={⟨A1 , . . ., Ak ⟩|ψ(θ ψ p(A1 , . . ., Ak )) = T} for p∈Σ .
ψ p
2. We show I (A) = A for terms A by induction on A: If A = f (A1 , . . . , An )

ψ
then I ψ (A) = I ψ (f )(I(A1 ), . . . , I(An )) = I ψ (f )(A1 , . . ., Ak ) = A.

3. For a PLnq formula A we show that I ψ (A) = I ψ (A) by induction on A.
3.1. If A = p(A1 , . . ., Ak ), then I ψ (A) = I ψ (p)(I(A1 ), . . . , I(An )) = T, iff
⟨A1 , . . ., Ak ⟩∈I ψ (p), iff ψ(θ−1 ψ A) = T, so I (A) = I ψ (A) as desired.
ψ
3.2. If A = ¬B, then I (A) = T, iff I (B) = F, iff I ψ (B) = I ψ (B), iff
ψ ψ
I ψ (A) = I ψ (A).
3.3. If A = B ∧ C then we argue similarly
4. Hence I ψ (A) = I ψ (A) for all PLnq formulae and we have concluded the proof.
12.4 Inference in Propositional Logics
We have now defined syntax (the language agents can use to represent knowledge) and its se-
mantics (how expressions of this language relate to the world the agent’s environment). Theoreti-
cally, an agent could use the entailment relation to derive new knowledge percepts and the existing
state representation – in the MAKE−PERCEPT−SENTENCE and MAKE−ACTION−SENTENCE
subroutines below. But as we have seen in above, this is very tedious. A much better way would
be to have a set of rules that directly act on the state representations.
Agents that Think Rationally

Think Before You Act!: A model based agent can think about the world and its
actions.
12.4. INFERENCE
Section 2.4. INThe
PROPOSITIONAL
Structure of Agents LOGICS 51 207
Sensors
State
is like now
Environment
What my actions do

should do now
Agent Actuators

persistent:function
KB,MaODEL knowledge
a counter,
of the world state
state ← U PDATE -S TATE (state, action , percept , model )
Idea: “Thinking”
using an= Reasoning
internal model. It thenabout
chooses anknowledge
action in the samerepresented using logic.
way as the reflex agent.
states areMichael
represented vary
Kohlhase: widely
Artificial depending
Intelligence 2 on the type of 329
environment and the2023-02-10
particular
technology used in the agent design. Detailed examples of models and updating algorithms
A Simple Formal System: Prop. Logic with Hilbert-Calculus
labeled “what the world is like now” (Figure 2.11) represents the agent’s “best guess” (or
Formulae: hold-up.
builtThus,
from propositional
uncertainty about the current variables: , Q, R. . .butand
state may bePunavoidable, implication:
the agent still has ⇒
to make a decision.
Semantics: IAφperhaps
(P ) =lessφ(P ) and
obvious point I φ (A
about the⇒ B) “state”
internal iff I φ (A)
= T,maintained = F or I φ (B) =
by a model-based T.
agent is that it does not have to describe “what the world is like now” in a literal sense. For
Definition 12.4.1. The Hilbert calculus H consists of the inference rules:
0
K S
P ⇒Q⇒P (P ⇒ Q ⇒ R) ⇒ (P ⇒ Q) ⇒ P ⇒ R
A⇒B A A
MP Subst
B [B/X](A)
Example 12.4.2. A H0 theorem C ⇒ C and its proof

Proof: We show that ∅⊢H0 C ⇒ C
1. (C ⇒ (C ⇒ C) ⇒ C) ⇒ (C ⇒ C ⇒ C) ⇒ C ⇒ C (S with
[C/P ], [C ⇒ C/Q], [C/R])
2. C ⇒ (C ⇒ C) ⇒ C (K with [C/P ], [C ⇒ C/Q])
3. (C ⇒ C ⇒ C) ⇒ C ⇒ C (MP on P.1 and P.2)
4. C ⇒ C ⇒ C (K with [C/P ], [C/Q])
5. C ⇒ C (MP on P.3 and P.4)
This is indeed a very simple formal system, but it has all the required parts:
• A formal language: expressions built up from variables and implications.
• A semantics: given by the obvious interpretation function
• A calculus: given by the two axioms and the two inference rules.
The calculus gives us a set of rules with which we can derive new formulae from old ones. The
axioms are very simple rules, they allow us to derive these two formulae in any situation. The
proper inference rules are slightly more complicated: we read the formulae above the horizontal
line as assumptions and the (single) formula below as the conclusion. An inference rule allows us
to derive the conclusion, if we have already derived the assumptions.
Now, we can use these inference rules to perform a proof – a sequence of formulae that can be
derived from each other. The representation of the proof in the slide is slightly compactified to fit
onto the slide: We will make it more explicit here. We first start out by deriving the formula
(P ⇒ Q ⇒ R) ⇒ (P ⇒ Q) ⇒ P ⇒ R (12.1)
which we can always do, since we have an axiom for this formula, then we apply the rule Subst,
where A is this result, B is C, and X is the variable P to obtain
(C ⇒ Q ⇒ R) ⇒ (C ⇒ Q) ⇒ C ⇒ R (12.2)
Next we apply the rule Subst to this where B is C ⇒ C and X is the variable Q this time to obtain
(C ⇒ (C ⇒ C) ⇒ R) ⇒ (C ⇒ C ⇒ C) ⇒ C ⇒ R (12.3)
And again, we apply the inference rulerule Subst this time, B is C and X is the variable R yielding
the first formula in our proof on the slide. To conserve space, we have combined these three steps
into one in the slide. The next steps are done in exactly the same way.
In general formulae can be used to represent facts about the world as propositions; they have a
semantics that is a mapping of formulae into the real world (propositions are mapped to truth
values.) We have seen two relations on formulae: the entailment relation and the deduction
relation. The first one is defined purely in terms of the semantics, the second one is given by a
calculus, i.e. purely syntactically. Is there any relation between these relations?
Soundness and Completeness

Definition 12.4.3. Let L:=⟨L, K, |=⟩ be a logical system, then we call a calculus
C for L, iff
sound (or correct), iff H |= A, whenever H⊢C A, and
complete, iff H⊢C A, whenever H |= A.
Goal: Find calculi C, such that ⊢C A iff |=A (provability and validity coincide)
To TRUTH through PROOF (CALCULEMUS [Leibniz ∼1680])

12.5. PROPOSITIONAL NATURAL DEDUCTION CALCULUS 209
Ideally, both relations would be the same, then the calculus would allow us to infer all facts
that can be represented in the given formal language and that are true in the real world, and only
those. In other words, our representation and inference is faithful to the world.
A consequence of this is that we can rely on purely syntactical means to make predictions
about the world. Computers rely on formal representations of the world; if we want to solve a
problem on our computer, we first represent it in the computer (as data structures, which can be
seen as a formal language) and do syntactic manipulations on these structures (a form of calculus).
Now, if the provability relation induced by the calculus and the validity relation coincide (this will
be quite difficult to establish in general), then the solutions of the program will be correct, and
we will find all possible ones.
Of course, the logics we have studied so far are very simple, and not able to express interesting
facts about the world, but we will study them as a simple example of the fundamental problem of
computer science: How do the formal representations correlate with the real world. Within the
world of logics, one can derive new propositions (the conclusions, here: Socrates is mortal) from
given ones (the premises, here: Every human is mortal and Sokrates is human). Such derivations
are proofs.
In particular, logics can describe the internal structure of real-life facts; e.g. individual things,
actions, properties. A famous example, which is in fact as old as it appears, is illustrated in the
slide below.
The miracle of logics

Purely formal derivations are true in the real world!
If a logic is correct, the conclusions one can prove are true (= hold in the real world) whenever
the premises are true. This is a miraculous fact (think about it!)
12.5 Propositional Natural Deduction Calculus

https://fau.tv/clip/id/22525. We will now introduce the “natural deduction” calculus
for propositional logic. The calculus was created in order to model the natural mode of reasoning
e.g. in everyday mathematical practice. In particular, it was intended as a counter-approach to
the well-known Hilbert style calculi, which were mainly used as theoretical devices for studying
reasoning in principle, not for modeling particular reasoning styles. We will introduce natural
deduction in two styles/notation, both were invented by Gerhard Gentzen in the 1930’s and
are very much related. The Natural Deduction style (ND) uses “local hypotheses” in proofs for
hypothetical reasoning, while the “sequent style” is a rationalized version and extension of the ND
calculus that makes certain meta-proofs simpler to push through by making the context of local
hypotheses explicit in the notation. The sequent notation also constitutes a more adequate data
struture for implementations, and user interfaces.
Rather than using a minimal set of inference rules, we introduce a natural deduction calculus that
provides two/three inference rules for every logical constant, one “introduction rule” (an inference
rule that derives a formula with that symbol at the head) and one “elimination rule” (an inference
rule that acts on a formula with this head and derives a set of subformulae).
Calculi: Natural Deduction (ND0 ; Gentzen [Gen34])

Idea: ND0 tries to mimic human argumentation for theorem proving.
Definition 12.5.1. The propositional natural deduction calculus ND0 has inference
rules for the introduction and elimination of connectives:
Introduction Elimination Axiom

A B A∧B A∧B
∧I ∧El ∧Er
A∧B A B
TND
A ∨ ¬A
1
[A]
B A⇒B A
⇒I 1 ⇒E
A⇒B B
⇒I proves A ⇒ B by exhibiting a ND0 derivation D (depicted by the double hori-

zontal lines) of B from the local hypothesis A; ⇒I then discharges (get rid of A,
which can only be used in D) the hypothesis and concludes A ⇒ B. This mode of
reasoning is called hypothetical reasoning.
Definition 12.5.2.
Given a set H ⊆ wff0 (V0 ) of assumptions and a conclusion C, we write H⊢ND0 C,
iff there is a ND0 derivation tree whose leaves are in H.
Note: TND is used only in classical logic (otherwise constructive/intuitionistic)
The most characteristic rule in the natural deduction calculus is the ⇒I rule and the hypothetical
reasoning it introduce. ⇒I corresponds to the mathematical way of proving an implication A⇒B:
We assume that A is true and show B from this local hypothesis. When we can do this we discharge
the assumption and conclude A ⇒ B.
Note that the local hypothesis is discharged by the rule ⇒I, i.e. it cannot be used in any other
part of the proof. As the ⇒I rules may be nested, we decorate both the rule and the corresponding
assumption with a marker (here the number 1).
Let us now consider an example of hypothetical reasoning in action.
12.5. PROPOSITIONAL NATURAL DEDUCTION CALCULUS 211
Natural Deduction: Examples

Example 12.5.3 (Inference with Local Hypotheses).
1
[A ∧ B]1 [A ∧ B]1 [A]
∧Er ∧El 2
B A [B]
∧I A
B∧A ⇒I 2
1
⇒I B⇒A
A∧B⇒B∧A ⇒I 1
A⇒B⇒A
Here we see hypothetical reasoning with local hypotheses at work. In the left example, we assume
the formula A ∧ B and can use it in the proof until it is discharged by the rule ∧El on the bottom
– therefore we decorate the hypothesis and the rule by corresponding numbers (here the label “1”).
Note the assumption A ∧ B is local to the proof fragment delineated by the corresponding local
hypothesishypothesis and the discharging rule, i.e. even if this proof is only a fragment of a larger
proof, then we cannot use its local hypothesishypothesis anywhere else.
Note also that we can use as many copies of the local hypothesis as we need; they are all
discharged at the same time.
In the right example we see that local hypotheses can be nested as long as they are kept local. In
particular, we may not use the hypothesis B after the ⇒I 2 , e.g. to continue with a ⇒E.
One of the nice things about the natural deduction calculus is that the deduction theorem is
almost trivial to prove. In a sense, the triviality of the deduction theorem is the central idea of
the calculus and the feature that makes it so natural.
A Deduction Theorem for ND0

Theorem 12.5.4. H, A⊢ND0 B, iff H⊢ND0 A ⇒ B.
Proof: We show the two directions separately
1. If H, A⊢ND0 B, then H⊢ND0 A ⇒ B by ⇒I , and
2. If H⊢ND0 A ⇒ B, then H, A⊢ND0 A ⇒ B by weakening and H, A⊢ND0 B by
⇒E.
Another characteristic of the natural deduction calculus is that it has inference rules (introduction
and elimination rules) for all connectives. So we extend the set of rules from Definition 12.5.1 for
disjunction, negation and falsity.
More Rules for Natural Deduction

Note: ND0 does not try to be minimal, but comfortable to work in!x
Definition 12.5.5. ND0 has the following additional inference rules for the remain-
ing connectives.
1 1
[A] [B]
.. ..
A∨B . .
A B C C ∨E 1
∨Il ∨Ir
A∨B A∨B C
1 1
[A] [A]
.. ..
. .
C ¬C ¬I 1 ¬¬A
¬E
¬A A
¬A A F
FI FE
F A
Again: ¬E is used only in classical logic (otherwise constructive/intuitionistic)
Natural Deduction in Sequent Calculus Formulation
Idea: Represent hypotheses explicitly. (lift calculus to judgments)

Definition 12.5.6. A judgment is a meta statement about the provability of propo-
sitions.
Definition 12.5.7. A sequent is a judgment of the form H⊢A about the provability
of the formula A from the set H of hypotheses. We write ⊢A for ∅⊢A.
Idea: Reformulate ND0 inference rules so that they act on sequents.
Example 12.5.8.We give the sequent style version of Example 12.5.3:
Ax Ax
A ∧ B⊢A ∧ B A ∧ B⊢A ∧ B Ax
∧Er ∧El A, B⊢A
A ∧ B⊢B A ∧ B⊢A ⇒I
∧I A⊢B ⇒ A
A ∧ B⊢B ∧ A ⇒I
⇒I ⊢A ⇒ B ⇒ A
⊢A ∧ B ⇒ B ∧ A
Note: Even though the antecedent of a sequent is written like a sequence, it is

actually a set. In particular, we can permute and duplicate members at will.
Sequent-Style Rules for Natural Deduction

12.6. CONCLUSION 213
Definition 12.5.9. The following inference rules make up the propositional sequent
style natural deduction calculus ND⊢0 :
Γ⊢B
Ax weaken TND
Γ, A⊢A Γ, A⊢B Γ⊢A ∨ ¬A
Γ⊢A Γ⊢B Γ⊢A ∧ B Γ⊢A ∧ B

∧I ∧El ∧Er
Γ⊢A ∧ B Γ⊢A Γ⊢B
Γ⊢A Γ⊢B Γ⊢A ∨ B Γ, A⊢C Γ, B⊢C

∨Il ∨Ir ∨E
Γ⊢A ∨ B Γ⊢A ∨ B Γ⊢C
Γ, A⊢B Γ⊢A ⇒ B Γ⊢A

⇒I ⇒E
Γ⊢A ⇒ B Γ⊢B
Γ, A⊢F Γ⊢¬¬A
¬I ¬E
Γ⊢¬A Γ⊢A
Γ⊢¬A Γ⊢A Γ⊢F

FI FE
Γ⊢F Γ⊢A
Linearized Notation for (Sequent-Style) ND Proofs

Linearized notation for sequent-style ND proofs
1. H1 ⊢ A1 (J 1 )
H1 ⊢A1 H2 ⊢A2
2. H2 ⊢ A2 (J 2 ) corresponds to R
H3 ⊢A3
3. H3 ⊢ A3 (J 3 1, 2)
Example 12.5.10. We show a linearized version of the ND0 examples Exam-
ple 12.5.8
# hyp ⊢ f ormula N Djust # hyp ⊢ f ormula N Djust

1. 1 ⊢ A∧B Ax 1. 1 ⊢ A Ax
2. 1 ⊢ B ∧Er 1 2. 2 ⊢ B Ax
3. 1 ⊢ A ∧El 1 3. 1, 2 ⊢ A weaken 1, 2
4. 1 ⊢ B∧A ∧I 2, 3 4. 1 ⊢ B⇒A ⇒I 3
5. ⊢ A∧B⇒B∧A ⇒I 4 5. ⊢ A⇒B⇒A ⇒I 4
Each row in the table represents one inference step in the proof. It consists of line number (for
referencing), a formula for the asserted property, a justification via a ND rules (and the rows this
one is derived from), and finally a list of row numbers of proof steps that are local hypotheses in
effect for the current row.
12.6 Conclusion
Summary
Sometimes, it pays off to think before acting.
In AI, “thinking” is implemented in terms of reasoning in order to deduce new

knowledge from a knowledge base represented in a suitable logic.
Logic prescribes a syntax for formulas, as well as a semantics prescribing which
interpretations satisfy them. A entails B if all interpretations that satisfy A also
satisfy B. deduction is the process of deriving new entailed formulas.
Propositional logic formulas are built from atomic propositions, with the connectives
and, or, not.
Issues with Propositional Logic

Awkward to write for humans: E.g., to model the Wumpus world we had to
make a copy of the rules for every cell . . .
R1 :=¬S 1,1 ⇒ ¬W 1,1 ∧ ¬W 1,2 ∧ ¬W 2,1
R2 :=¬S 2,1 ⇒ ¬W 1,1 ∧ ¬W 2,1 ∧ ¬W 2,2 ∧ ¬W 3,1
R3 :=¬S 1,2 ⇒ ¬W 1,1 ∧ ¬W 1,2 ∧ ¬W 2,2 ∧ ¬W 1,3
Compared to
Cell adjacent to Wumpus: Stench (else: None)

that is not a very nice description language . . .
Time: For things that change (e.g., Wumpus moving according to certain rules),
we need time-indexed propositions (like, S2,1
t=7
) to represent validity over time ;
further expansion of the rules.
Can we design a more human-like logic?: Yep
Predicate logic: quantification of variables ranging over individuals. (cf.
chapter 16 and chapter 17)
. . . and a whole zoo of logics much more powerful still.
Note: In applications, propositional CNF encodings are generated by computer
programs. This mitigates (but does not remove!) the inconveniences of propo-
sitional modeling.
Suggested Reading:
• Chapter 7: Logical Agents, Sections 7.1 – 7.5 [RN09].
– Sections 7.1 and 7.2 roughly correspond to my “Introduction”, Section 7.3 roughly corresponds
to my “Logic (in AI)”, Section 7.4 roughly corresponds to my “Propositional Logic”, Section
7.5 roughly corresponds to my “Resolution” and “Killing a Wumpus”.
– Overall, the content is quite similar. I have tried to add some additional clarifying illustrations.
RN gives many complementary explanations, nice as additional background reading.
– I would note that RN’s presentation of resolution seems a bit awkward, and Section 7.5 con-
tains some additional material that is imho not interesting (alternate inference rules, forward
and backward chaining). Horn clauses and unit resolution (also in Section 7.5), on the other
hand, are quite relevant.
Chapter 13
Machine-Oriented Calculi for

Propositional Logic
A Video Nugget covering this chapter can be found at https://fau.tv/clip/id/22531.
Automated Deduction as an Agent Inference Procedure
Recall: Our knowledge of the cave entails a definite Wumpus position!(slide 306)
Problem: That was human reasoning, can we build an agent function that does
this?
Answer:
As for constraint networks, we use inference, here resolution/tableaux.
The following theorem is simple, but will be crucial later on.
Unsatisfiability Theorem
Theorem 13.0.1 (Unsatisfiability Theorem). H |= A iff H ∪ {¬A} is unsatisfi-
able.
Proof: We prove both directions separately
1. “⇒”: Say H |= A
1.1. For any φ with φ|=H we have φ|=A and thus φ̸|=¬A.
2. “⇐”: Say H ∪ {¬A} is unsatisfiable.
2.1. For any φ with φ|=H we have φ̸|=¬A and thus φ|=A.
Observation 13.0.2. Entailment can be tested via satisfiability.
Test Calculi: A Paradigm for Automating Inference
217
218 CHAPTER 13. MACHINE-ORIENTED CALCULI FOR PROPOSITIONAL LOGIC
Definition 13.0.3. Given a formal system ⟨L, K, |=, C ⟩, the task of theorem proving
consists in determining whether H⊢C C for a conjecture C∈L and hypotheses H ⊆
L.
Definition 13.0.4. Given a logical system L:=⟨L, K, |=⟩, the task of automated
theorem proving (ATP) consists of developing calculi for L and programs – called
(automated) theorem provers – that given a set H ⊆ L of hypotheses and a conjec-
ture A∈L determine whether H |= A (usually by searching for C-derivations H⊢C A
in a calculus C).
Idea: ATP with a calculus C for ⟨L, K, |=⟩ induces a search problem Π:=⟨S , A, T , I , G ⟩,
where the states S are sets of formulae in L, the actions A are the inference rules
from C, the initial state I = {H}, and the goal states are those with A∈S.
Problem: ATP as a search problem does not admit good heuristics, since these
need to take the conjecture A into account.
Idea: Turn the search around – using the unsatisfiability theorem (Theorem 13.0.1).
Definition 13.0.5. For a given conjecture A and hypotheses H a test calculus T

tries to derive H, A⊢T ⊥ instead of H⊢A, where A is unsatisfiable iff A is valid and
⊥, an “obviously” unsatisfiable formula.
A derivation H, A⊢T ⊥ is called a refutation of A (from H, if H ̸= ∅).
Observation: A test calculus C induces a search problem where the initial state
is H ∪ {¬A} and S∈S is a goal state iff ⊥∈S.(proximity of ⊥ easier for heuristics)
13.1 Normal Forms

Before we can start, we will need to recap some nomenclature on formulae.
Recap: Atoms and Literals

Definition 13.1.1. A formula is called atomic (or an atom) if it does not contain
logical constants, else it is called complex.
Definition 13.1.2. If A be a formula, then we call a pair Aα a labeled formula, if

α∈{T, F}. For a set Φ of formulae we use Φα :={Aα |A∈Φ}.
Definition 13.1.3. A labeled atom Aα is called a (positive if α = T, else negative)
literal.
Intuition: To satisfy a formula, we make it “true”. To satisfy a labeled formula

Aα , it must have the truth value α.
Definition 13.1.4.
For a literal Aα , we call the literal Aβ with α ̸= β the opposite literal (or partner
literal).
The idea about literals is that they are atoms (the simplest formulae) that carry around their
intended truth value.
13.2. ANALYTICAL TABLEAUX 219
Alternative Definition: Literals

Note:
Literals are often defined without recurring to labeled formulae:
Definition 13.1.5. A literal is an atoms A (positive literal) or negated atoms ¬A
(negative literal). A and ¬A are opposite literals.
Note:
This notion of literal is equivalent to the labeled formulae-notion of literal, but does
not generalize as well to logics with more truth values.
Normal Forms
There are two quintessential normal forms for propositional formulae: (there are
others as well)
Definition 13.1.6.
A formula is in conjunctive normal^ form_
(CNF) if it is a conjunction of disjunctions
of literals: i.e. if it is of the form i=1 m
n
j=1 lij
i
Definition 13.1.7.
A formula is in disjunctive normal_ form ^
(DNF) if it is a disjunction of conjunctions
of literals: i.e. if it is of the form ni=1 mj=1 lij
i
Observation 13.1.8. Every formula has equivalent formulae in CNF and DNF.
Video Nuggets covering this chapter can be found at https://fau.tv/clip/id/23705 and

13.2 Analytical Tableaux
Test Calculi: Tableaux and Model Generation

Idea: A tableau calculus is a test calculus that
analyzes a labeled formulae in a tree to determine satisfiability,
its branches correspond to valuations (; models).
Example 13.2.1.Tableau calculi try to construct models for labeled formulae:
Tableau refutation (Validity) Model generation (Satisfiability)

|=P ∧ Q ⇒ Q ∧ P |=P ∧ (Q ∨ ¬R) ∧ ¬Q
(P ∧ (Q ∨ ¬R) ∧ ¬Q)T
(P ∧ Q ⇒ Q ∧ P )F
(P ∧ (Q ∨ ¬R))T
(P ∧ Q)T
¬QT
(Q ∧ P )F QF
PT PT
QT
(Q ∨ ¬R)T
P F QF
QT ¬RT
⊥ ⊥
⊥ RF
No Model Herbrand Model {P T , QF , RF }
φ:={P 7→ T, Q 7→ F, R 7→ F}
Idea: Open branches in saturated tableaux yield models.

Algorithm: Fully expand all possible tableaux, (no rule can be applied)
Satisfiable, iff there are open branches (correspond to models)
Tableau calculi develop a formula in a tree-shaped arrangement that represents a case analysis on
when a formula can be made true (or false). Therefore the formulae are decorated with exponents
that hold the intended truth value.
On the left we have a refutation tableau that analyzes a negated formula (it is decorated with the
intended truth value F). Both branches contain an elementary contradiction ⊥.
On the right we have a model generation tableau, which analyzes a positive formula (it is
decorated with the intended truth value T. This tableau uses the same rules as the refutation
tableau, but makes a case analysis of when this formula can be satisfied. In this case we have a
closed branch and an open one, which corresponds a model).
Now that we have seen the examples, we can write down the tableau rules formally.
Analytical Tableaux (Formal Treatment of T0 )

Idea: A test calculus where
A labeled formula is analyzed in a tree to determine satisfiability,
branches correspond to valuations (models)
Definition 13.2.2. The propositional tableau calculus T0 has two inference rules
per connective (one for each possible label)
Aα
T F α ̸= β
(A ∧ B) (A ∧ B) ¬A T
¬A F
Aβ
T0 ∧
T0 ∨ T0 ¬T T0 ¬F T0 ⊥
AT AF BF AF AT ⊥
BT
Use rules exhaustively as long as they contribute new material (; termination)

Definition 13.2.3. We call any tree ( introduces branches) produced by the T0
inference rules from a set Φ of labeled formulae a tableau for Φ.
13.2. ANALYTICAL TABLEAUX 221
Definition 13.2.4. Call a tableau saturated, iff no rule adds new material and a
branch closed, iff it ends in ⊥, else open. A tableau is closed, iff all of its branches
are.
These inference rules act on tableaux have to be read as follows: if the formulae over the line
appear in a tableau branch, then the branch can be extended by the formulae or branches below
the line. There are two rules for each primary connective, and a branch closing rule that adds the
special symbol ⊥ (for unsatisfiability) to a branch.
We use the tableau rules with the convention that they are only applied, if they contribute new
material to the branch. This ensures termination of the tableau procedure for propositional logic
(every rule eliminates one primary connective).
Definition 13.2.5. We will call a closed tableau with the labeled formula Aα at the root a
tableau refutation for Aα .
The saturated tableau represents a full case analysis of what is necessary to give A the truth value
α; since all branches are closed (contain contradictions) this is impossible.
Analytical Tableaux (T0 continued)
Definition 13.2.6 (T0 -Theorem/Derivability).

A is a T0 -theorem (⊢T0 A), iff there is a closed tableau with AF at the root.
Φ ⊆ wff0 (V0 ) derives A in T0 (Φ⊢T0 A), iff there is a closed tableau starting with AF
and ΦT . The tableau with only a branch of AF and ΦT is called initial for Φ⊢T0 A.
Definition 13.2.7. We will call a tableau refutation for AF a tableau proof for A, since it refutes
the possibility of finding a model where A evaluates to F. Thus A must evaluate to T in all
models, which is just our definition of validity.
Thus the tableau procedure can be used as a calculus for propositional logic. In contrast to the
propositional Hilbert calculus it does not prove a theorem A by deriving it from a set of axioms,
but it proves it by refuting its negation. Such calculi are called negative or test calculi. Generally
negative calculi have computational advantages over positive ones, since they have a built-in sense
of direction.
We have rules for all the necessary connectives (we restrict ourselves to ∧ and ¬, since the others
can be expressed in terms of these two via the propositional identities above. For instance, we can
write A ∨ B as ¬(¬A ∧ ¬B), and A ⇒ B as ¬A ∨ B,. . . .)
We now look at a formulation of propositional logic with fancy variable names. Note that
loves(mary, bill) is just a variable name like P or X, which we have used earlier.
A Valid Real-World Example

Example 13.2.8. If Mary loves Bill and John loves Mary, then John loves Mary
F
(loves(mary, bill) ∧ loves(john, mary) ⇒ loves(john, mary))
F
¬(¬¬(loves(mary, bill) ∧ loves(john, mary)) ∧ ¬loves(john, mary))
T
(¬¬(loves(mary, bill) ∧ loves(john, mary)) ∧ ¬loves(john, mary))
T
¬¬(loves(mary, bill) ∧ loves(john, mary))
F
¬(loves(mary, bill) ∧ loves(john, mary))
T
(loves(mary, bill) ∧ loves(john, mary))
T
¬loves(john, mary)
T
loves(mary, bill)
T
loves(john, mary)
F
loves(john, mary)
⊥
This is a closed tableau, so the loves(mary, bill)∧loves(john, mary)⇒loves(john, mary)

is a T0 -theorem.
As we will see, T0 is sound and complete, so
loves(mary, bill) ∧ loves(john, mary) ⇒ loves(john, mary)
is valid.
We could have used the unsatisfiability theorem (Theorem 13.0.1) here to show that If Mary loves
Bill and John loves Mary entails John loves Mary. But there is a better way to show entailment:
we directly use derivability in T0 .
Deriving Entailment in T0
Example 13.2.9. Mary loves Bill and John loves Mary together entail that John
loves Mary
T
loves(mary, bill)
T
loves(john, mary)
F
loves(john, mary)
⊥
This is a closed tableau, so {loves(mary, bill), loves(john, mary)}⊢T0 loves(john, mary).
Again, as T0 is sound and complete we have
{loves(mary, bill), loves(john, mary)} |= loves(john, mary)
Note: We can also use the tableau calculus to try and show entailment (and fail). The nice thing
is that the failed proof, we can see what went wrong.
A Falsifiable Real-World Example

Example 13.2.10. * If Mary loves Bill or John loves Mary, then John loves
Mary
13.3. PRACTICAL ENHANCEMENTS FOR TABLEAUX 223
Try proving the implication (this fails)

F
((loves(mary, bill) ∨ loves(john, mary)) ⇒ loves(john, mary))
F
¬(¬¬(loves(mary, bill) ∨ loves(john, mary)) ∧ ¬loves(john, mary))
T
(¬¬(loves(mary, bill) ∨ loves(john, mary)) ∧ ¬loves(john, mary))
T
¬loves(john, mary)
F
loves(john, mary)
T
¬¬(loves(mary, bill) ∨ loves(john, mary))
F
¬(loves(mary, bill) ∨ loves(john, mary))
T
(loves(mary, bill) ∨ loves(john, mary))
T T
loves(mary, bill) loves(john, mary)
⊥
Indeed we can make I φ (loves(mary, bill)) = T but I φ (loves(john, mary)) = F.
Obviously, the tableau above is saturated, but not closed, so it is not a tableau proof for our initial
entailment conjecture. We have marked the literal on the open branch green, since they allow us
to read of the conditions of the situation, in which the entailment fails to hold. As we intuitively
argued above, this is the situation, where Mary loves Bill. In particular, the open branch gives us
a variable assignment (marked in green) that satisfies the initial formula. In this case, Mary loves
Bill, which is a situation, where the entailment fails.
Again, the derivability version is much simpler:
Testing for Entailment in T0

Example 13.2.11. Does Mary loves Bill or John loves Mary entail that John
loves Mary?
T
(loves(mary, bill) ∨ loves(john, mary))
F
loves(john, mary)
T T
loves(mary, bill) loves(john, mary)
⊥
This saturated tableau has an open branch that shows that the interpretation with
I φ (loves(mary, bill)) = T but I φ (loves(john, mary)) = F falsifies the derivability/en-
tailment conjecture.
We have seen in the examples above that while it is possible to get by with only the connectives
∨ and ¬, it is a bit unnatural and tedious, since we need to eliminate the other connectives first.
In this chapter, we will make the calculus less frugal by adding rules for the other connectives,
without losing the advantage of dealing with a small calculus, which is good making statements
about the calculus itself.
13.3 Practical Enhancements for Tableaux

The main idea here is to add the new rules as derivable inference rules, i.e. rules that only
abbreviate derivations in the original calculus. Generally, adding derivable inference rules does
not change the derivation relation of the calculus, and is therefore a safe thing to do. In particular,
we will add the following rules to our tableau calculus.
We will convince ourselves that the first rule is derivable, and leave the other ones as an exercise.
Derived Rules of Inference

A1 . . . An
Definition 13.3.1. An inference rule is called derivable (or a derived
C
rule) in a calculus C, if there is a C derivation A1 , . . ., An ⊢C C.
Definition 13.3.2. We have the following derivable inference rules in T0 :
AT AT
T T
(A ⇒ B)
T
(A ⇒ B)
F
(A ⇒ B) (A ⇒ B)
T
AT BT (¬A ∨ B)
AF BT T
BF ¬(¬¬A ∧ ¬B)
F
T F
(¬¬A ∧ ¬B)
(A ∨ B) (A ∨ B) A ⇔ BT A ⇔ BF ¬¬AF ¬BF

AF AT AF AT AF ¬AT BT
AT BT
BF BT BF BF BT AF
⊥
With these derived rules, theorem proving becomes quite efficient. With these rules, the tableau
(Example 13.2.8) would have the following simpler form:
Tableaux with derived Rules (example)

Example 13.3.3.
F
(loves(mary, bill) ∧ loves(john, mary) ⇒ loves(john, mary))
T
(loves(mary, bill) ∧ loves(john, mary))
F
loves(john, mary)
T
loves(mary, bill)
T
loves(john, mary)
⊥
13.4 Soundness and Termination of Tableaux
As always we need to convince ourselves that the calculus is sound, otherwise, tableau proofs
do not guarantee validity, which we are after. Since we are now in a refutation setting we cannot
just show that the inference rules preserve validity: we care about unsatisfiability (which is the
dual notion to validity), as we want to show the initial labeled formula to be unsatisfiable. Before
we can do this, we have to ask ourselves, what it means to be (un)-satisfiable for a labeled formula
or a tableau.
Soundness (Tableau)
Idea: A test calculus is refutation sound, iff its inference rules preserve satisfiability
13.4. SOUNDNESS AND TERMINATION OF TABLEAUX 225
and the goal formulae are unsatisfiable.

Definition 13.4.1. A labeled formula Aα is valid under φ, iff I φ (A) = α.
Definition 13.4.2. A tableau T is satisfiable, iff there is a satisfiable branch P in
T , i.e. if the set of formulae on P is satisfiable.
Lemma 13.4.3. T0 rules transform satisfiable tableaux into satisfiable ones.
Theorem 13.4.4 (Soundness). T0 is sound, i.e. Φ ⊆ wff0 (V0 ) valid, if there is a
closed tableau T for ΦF .
Proof: by contradiction
1. Suppose Φ isfalsifiable =b not valid.
2. Then the initial tableau is satisfiable, (ΦF satisfiable)
3. so T is satisfiable, by Lemma 13.4.3.
4. Thus there is a satisfiable branch (by definition)
5. but all branches are closed (T closed)
Theorem 13.4.5 (Completeness). T0 is complete, i.e. if Φ ⊆ wff0 (V0 ) is valid,

then there is a closed tableau T for ΦF .
Proof sketch: Proof difficult/interesting; see Corollary A.2.2
Thus we only have to prove Lemma 13.4.3, this is relatively easy to do. For instance for the first
T
rule: if we have a tableau that contains (A ∧ B) and is satisfiable, then it must have a satisfiable
T
branch. If (A ∧ B) is not on this branch, the tableau extension will not change satisfiability,
so we can assume that it is on the satisfiable branch and thus I φ (A ∧ B) = T for some variable
assignment φ. Thus I φ (A) = T and I φ (B) = T, so after the extension (which adds the formulae
AT and BT to the branch), the branch is still satisfiable. The cases for the other rules are
similar.
The next result is a very important one, it shows that there is a procedure (the tableau
procedure) that will always terminate and answer the question whether a given propositional
formula is valid or not. This is very important, since other logics (like the often-studied first-order
logic) does not enjoy this property.
Termination for Tableaux

Lemma 13.4.6. T0 terminates, i.e. every T0 tableau becomes saturated after
finitely many rule applications.
Proof: By examining the rules wrt. a measure µ
1. Let us call a labeled formulae Aα worked off in a tableau T , if a T0 rule has already
been applied to it.
2. It is easy to see that applying rules to worked off formulae will only add formulae that
are already present in its branch.
3. Let µ(T ) be the number of connectives in labeled formulae in T that are not worked
off.
4. Then each rule application to a labeled formula in T that is not worked off reduces
µ(T ) by at least one. (inspect the rules)
5. At some point the tableau only contains worked off formulae and literals.
6. Since there are only finitely many literals in T , so we can only apply T0 ⊥ a finite
number of times.
Corollary 13.4.7. T0 induces a decision procedure for validity in PL0 .
Proof: We combine the results so far

1. By Lemma 13.4.6 it is decidable whether ⊢T0 A
2. By soundness (Theorem 13.4.4) and completeness (Theorem 13.4.5), ⊢T0 A iff A is
valid.
Note: The proof above only works for the “base T0 ” because (only) there the rules do not “copy”.
A rule like
A ⇔ BT

AT AF
BT BF
does, and in particular the number of non-worked-off variables below the line is larger than above
the line. For such rules, we would have a more intricate version of µ which – instead of returning
a natural number – returns a more complex object; a multiset of numbers. would work here. In
our proof we are just assuming that the defined connectives have already eliminated.
The tableau calculus basically computes the disjunctive normal form: every branch is a disjunct
that is a conjunction of literals. The method relies on the fact that a DNF is unsatisfiable, iff each
literal is, i.e. iff each branch contains a contradiction in form of a pair of opposite literals.
13.5 Resolution for Propositional Logic

The next calculus is a test calculus based on the conjunctive normal form: the resolution
calculus. In contrast to the tableau method, it does not compute the normal form as it goes along,
but has a pre-processing step that does this and a single inference rule that maintains the normal
form. The goal of this calculus is to derive the empty clause, which is unsatisfiable.
Another Test Calculus: Resolution

Definition 13.5.1. A clause is a disjunction lα 1 ∨ . . . ∨ ln of literals. We will use
1 αn
2 for the “empty” disjunction (no disjuncts) and call it the empty clause. A clause
with exactly one literal is called a unit clause.
Definition 13.5.2 (Resolution Calculus). The resolution calculus R0 operates a
clause sets via a single inference rule:
PT ∨ A PF ∨ B
R
A∨B
This rule allows to add the resolvent (the clause below the line) to a clause set which
contains the two clauses above. The literals P T and P F are called cut literals.
Definition 13.5.3 (Resolution Refutation). Let S be a clause set, then we call
an R0 -derivation of 2 from S R0 -refutation and write D : S⊢R0 2.
Clause Normal Form Transformation (A calculus)

13.5. RESOLUTION FOR PROPOSITIONAL LOGIC 227
Definition 13.5.4.
We will often write a clause set {C 1 , . . ., C n } as C 1 ; . . . ; C n , use S ; T for the
union of the clause sets S and T , and S ; C for the extension by a clause C.
Definition 13.5.5 (Transformation into Clause Normal Form). The CNF trans-
formation calculus CNF0 consists of the following four inference rules on sets of
labeled formulae.
T F
C ∨ (A ∨ B) C ∨ (A ∨ B) C ∨ ¬AT C ∨ ¬AF
C ∨ AT ∨ BT C ∨ AF ; C ∨ BF C ∨ AF C ∨ AT
Definition 13.5.6. We write CNF0 (Aα ) for the set of all clauses derivable from
Aα via the rules above.
that the C-terms in the definition of the inference rules are necessary, since we assumed that
the assumptions of the inference rule must match full clauses. The C terms are used with the
T
convention that they are optional. So that we can also simplify (A ∨ B) to AT ∨ BT .
Background: The background behind this notation is that A and T ∨ A are equivalent for any
A. That allows us to interpret the C-terms in the assumptions as T and thus leave them out.
The clause normal form translation as we have formulated it here is quite frugal; we have left
out rules for the connectives ∨, ⇒, and ⇔, relying on the fact that formulae containing these
connectives can be translated into ones without before CNF transformation. The advantage of
having a calculus with few inference rules is that we can prove meta properties like soundness and
completeness with less effort (these proofs usually require one case per inference rule). On the
other hand, adding specialized inference rules makes proofs shorter and more readable.
Fortunately, there is a way to have your cake and eat it. Derivable inference rules have the property
that they are formally redundant, since they do not change the expressive power of the calculus.
Therefore we can leave them out when proving meta-properties, but include them when actually
using the calculus.
Derived Rules of Inference

A1 . . . An
C
Idea: Derived rules make proofs shorter.
T
C ∨ (A ⇒ B)
T
C ∨ (¬AB) C ∨ (A ⇒ B)
T
Example 13.5.8. ;
C ∨ ¬AT ∨ BT C ∨ AF ∨ BT
C ∨ AF ∨ BT
Other Derived CNF Rules:

T F T F
C ∨ (A ⇒ B) C ∨ (A ⇒ B) C ∨ (A ∧ B) C ∨ (A ∧ B)
C ∨ AF ∨ BT C ∨ AT ; C ∨ BF C ∨ AT ; C ∨ BT C ∨ AF ∨ BF
With these derivable rules, theorem proving becomes quite efficient. To get a better under-
standing of the calculus, we look at an example: we prove an axiom of the Hilbert Calculus we
have studied above.
Example: Proving Axiom S with Resolution

Example 13.5.9. Clause Normal Form transformation
F
((P ⇒ Q ⇒ R) ⇒ (P ⇒ Q) ⇒ P ⇒ R)
T F
(P ⇒ Q ⇒ R) ; ((P ⇒ Q) ⇒ P ⇒ R)
T T F
P F ∨ (Q ⇒ R) ; (P ⇒ Q) ; (P ⇒ R)
P F ∨ QF ∨ RT ; P F ∨ QT ; P T ; RF
Result {P F ∨ QF ∨ RT , P F ∨ QT , P T , RF }
Example 13.5.10. Resolution Proof
1 P F ∨ QF ∨ RT initial
2 P F ∨ QT initial
3 PT initial
4 RF initial
5 P F ∨ QF resolve 1.3 with 4.1
6 QF resolve 5.1 with 3.1
7 PF resolve 2.2 with 6.1
8 2 resolve 7.1 with 3.1
Clause Set Simplification

Observation: Let ∆ be a clause set, l a literal, and ∆′ be ∆ where
all clauses l ∨ C have been removed and
and all clauses l ∨ C have been shortened to C.
Then ∆ is satisfiable, iff ∆′ is. We call ∆′ the clause set simplification of ∆ wrt. l.
Corollary 13.5.11. Adding clause set simplification wrt. unit clauses to R0 does
not affect soundness and completeness.
This is almost always a good idea! (clause set simplification is cheap)
13.6 Killing a Wumpus with Propositional Inference

Let us now consider an extended example, where we also address the question how inference
in PL0 – here resolution is embedded into the rational agent metaphor we use in AI-2: we come
back to the Wumpus world.
13.6. KILLING A WUMPUS WITH PROPOSITIONAL INFERENCE 229
Applying Propositional Inference: Where is the Wumpus?

Example 13.6.1 (Finding the Wumpus). The situation and what the agent
knows
What should the agent do next and why?

One possibility: Convince yourself that the Wumpus is in [1, 3] and shoot it.
What is the general mechanism here? (for the agent function)
Before we come to the general mechanism, we will go into how we would “convince ourselves that
the Wumpus is in [1, 3].
Where is the Wumpus? Our Knowledge
Idea: We formalize the knowledge about the Wumpus world in PL0 and use a test
calculus to check for entailment.
Simplification: We worry only about the Wumpus and stench:
b stench in [i, j], W i,j =
S i,j = b Wumpus in [i, j].
Propositions whose value we know: ¬S 1,1 , ¬W 1,1 , ¬S 2,1 , ¬W 2,1 , S 1,2 , ¬W 1,2 .
Knowledge about the Wumpus and smell:

From Cell adjacent to Wumpus: Stench (else: None), we get
R1 :=¬S 1,1 ⇒ ¬W 1,1 ∧ ¬W 1,2 ∧ ¬W 2,1

R2 :=¬S 2,1 ⇒ ¬W 1,1 ∧ ¬W 2,1 ∧ ¬W 2,2 ∧ ¬W 3,1
R3 :=¬S 1,2 ⇒ ¬W 1,1 ∧ ¬W 1,2 ∧ ¬W 2,2 ∧ ¬W 1,3
R4 :=S 1,2 ⇒ (W 1,3 ∨ W 2,2 ∨ W 1,1 )
..
.
To show:
R1 , R2 , R3 , R4 |= W 1,3 (we will use resolution)
The first in is to compute the clause normal form of the relevant knowledge.
And Now Using Resolution Conventions

We obtain the clause set ∆ composed of the following clauses:
Propositions whose value we know: S 1,1 F , W 1,1 F , S 2,1 F , W 2,1 F , S 1,2 T ,
W 1,2 F
Knowledge about the Wumpus and smell:
from clauses
R1 S 1,1 T ∨ W 1,1 F , S 1,1 T ∨ W 1,2 F , S 1,1 T ∨ W 2,1 F
R2 S 2,1 T ∨ W 1,1 F , S 2,1 T ∨ W 2,1 F , S 2,1 T ∨ W 2,2 F , S 2,1 T ∨ W 3,1 F
R3 S 1,2 T ∨ W 1,1 F , S 1,2 T ∨ W 1,2 F , S 1,2 T ∨ W 2,2 F , S 1,2 T ∨ W 1,3 F
R4 S 1,2 F ∨ W 1,3 T ∨ W 2,2 T ∨ W 1,1 T
Negated goal formula: W 1,3 F
Given this clause normal form, we only need to find generate empty clause via repeated applications
of the resolution rule.
Resolution Proof Killing the Wumpus!

Example 13.6.2 (Where is the Wumpus). We show a derivation that proves
that he is in (1, 3).
Assume the Wumpus is not in (1, 3). Then either there’s no stench in (1, 2),
or the Wumpus is in some other neigbor cell of (1, 2).
Parents: W 1,3 F and S 1,2 F ∨ W 1,3 T ∨ W 2,2 T ∨ W 1,1 T .
Resolvent: S 1,2 F ∨ W 2,2 T ∨ W 1,1 T .
There’s a stench in (1, 2), so it must be another neighbor.
Parents: S 1,2 T and S 1,2 F ∨ W 2,2 T ∨ W 1,1 T .
Resolvent: W 2,2 T ∨ W 1,1 T .
We’ve been to (1, 1), and there’s no Wumpus there, so it can’t be (1, 1).
Parents: W 1,1 F and W 2,2 T ∨ W 1,1 T .
Resolvent: W 2,2 T .
13.6. KILLING A WUMPUS WITH PROPOSITIONAL INFERENCE 231
There is no stench in (2, 1) so it can’t be (2, 2) either, in contradiction.

Parents: S 2,1 F and S 2,1 T ∨ W 2,2 F .
Resolvent: W 2,2 F .
Parents: W 2,2 F and W 2,2 T .
Resolvent: 2.
As resolution is sound, we have shown that indeed R1 , R2 , R3 , R4 |= W 1,3 .
Now that we have seen how we can use propositional inference to derive consequences of the
percepts and world knowledge, let us come back to the question of a general mechanism for agent
functions with propositional inference.
Where does the Conjecture W 1,3 F come from?
Question: Where did the W 1,3 F come from?

Observation 13.6.3. We need a general mechanism for making conjectures.
Idea: Interpret the Wumpus world as a search problem P:=⟨S , A, T , I , G ⟩ where
the states S are given by the cells (and agent orientation) and
the actions A by the possible actions of the agent.
Use tree search as the main agent function and a test calculus for testing all dangers
(pits), opportunities (gold) and the Wumpus.
Example 13.6.4 (Back to the Wumpus). In Example 13.6.1, the agent is in
[1, 2], it has perceived stench, and the possible actions include shoot, and goForward.
Evaluating either of these leads to the conjecture W 1,3 . And since W 1,3 is entailed,
the action shoot probably comes out best, heuristically.
Remark: Analogous to the backtracking with inference algorithm from CSP.
Admittedly, the search framework from chapter 8 does not quite cover the agent function we
have here, since that assumes that the world is fully observable, which the Wumpus world is
emphatically not. But it already gives us a good impression of what would be needed for the
“general mechanism”.
Summary
Every propositional formula can be brought into conjunctive normal form (CNF),
which can be identified with a set of clauses.
The tableau and resolution calculi are deduction procedures based on trying to derive
a contradiction from the negated theorem (a closed tableau or the empty clause).
They are refutation complete, and can be used to prove KB |= A by showing that
KB ∪ {¬A} is unsatisfiable.
Excursion: A full analysis of any calculus needs a completeness proof. We will not cover this in
AI-2, but provide one for the calculi introduced so far in??.
Chapter 14
Formal Systems: Syntax, Semantics,

Entailment, and Derivation in
General
We will now take a more abstract view and introduce the necessary prerequisites of abstract rule
systems. We will also take the opportunity to discuss the quality criteria for calculi.
Recap: General Aspects of Propositional Logic

There are many ways to define Propositional Logic:
We chose ∧ and ¬ as primitive, and many others as defined.

We could have used ∨ and ¬ just as well.
We could even have used only one connective e.g. negated conjunction ↑ or
disjunction NOR and defined ∧, ∨, and ¬ via ↑ and NOR respectively.
↑ ⊤ ⊥ NOR ⊤ ⊥
⊤ F T ⊤ F F
⊥ T T ⊥ F T
¬a a↑a a N OR a
ab a ↑ b ↑ a ↑ b a N OR ab N OR b
ab a ↑ a ↑ b ↑ b a N OR b N OR a N OR b
Observation: The set wff0 (V0 ) of well-formed propositional formulae is a formal

language over the alphabet given by V0 , the connectives, and brackets.
Recall: We are mostly interested in
satisfiability i.e. whether M|=φ A, and

entailment i.e whether A |= B.
Observation: In particular, the inductive/compositional nature of wff0 (V0 ) and
I φ : wff0 (V0 )→D0 are secondary.
Idea: Concentrate on language, models (M, φ), and satisfiability.
The notion of a logical system is at the basis of the field of logic. In its most abstract form, a logical
233
234 CHAPTER 14. FORMAL SYSTEMS
system consists of a formal language, a class of models, and a satisfaction relation between models
and expressions of the formal language. The satisfaction relation tells us when an expression is
deemed true in this model.
Logical Systems
Definition 14.0.1. A logical system (or simply a logic) is a triple L:=⟨L, K, |=⟩,
where L is a formal language, K is a set and |= ⊆ K × L. Members of L are called
formulae of L, members of K models for L, and |= the satisfaction relation.
Example 14.0.2 (Propositional Logic).
⟨wff(ΣP L0 , V P L0 ), K, |=⟩ is a logical system, if we define K:=V0 ⇀ D0 (the set of
variable assignments) and φ |= A iff I φ (A) = T.
Definition 14.0.3.
Let ⟨L, K, |=⟩ be a logical system, M∈K be a model and A∈L a formula, then we
say that A is
satisfied by M, iff M|=A.

falsified by M, iff M̸|=A.
satisfiable in K, iff M|=A for some M∈K.
valid in K (write |=M), iff M|=A for all M∈K.
falsifiable in K, iff M̸|=A for some M∈K.
unsatisfiable in K, iff M̸|=A for all M∈K.
Let us now turn to the syntactical counterpart of the entailment relation: derivability in a
calculus. Again, we take care to define the concepts at the general level of logical systems.
The intuition of a calculus is that it provides a set of syntactic rules that allow to reason by
considering the form of propositions alone. Such rules are called inference rules, and they can be
strung together to derivations — which can alternatively be viewed either as sequences of formulae
where all formulae are justified by prior formulae or as trees of inference rule applications. But we
can also define a calculus in the more general setting of logical systems as an arbitrary relation on
formulae with some general properties. That allows us to abstract away from the homomorphic
setup of logics and calculi and concentrate on the basics.
Derivation Relations and Inference Rules

Definition 14.0.4. Let L:=⟨L, K, |=⟩ be a logical system, then we call a relation
⊢ ⊆ P(L) × L a derivation relation for L, if
H⊢A, if A∈H (⊢ is proof reflexive),
H⊢A and H′ ∪ {A}⊢B imply H ∪ H′ ⊢B (⊢ is proof transitive),
H⊢A and H ⊆ H′ imply H′ ⊢A (⊢ is monotonic or admits weakening).
Definition 14.0.5. We call ⟨L, K, |=, C ⟩ a formal system, iff L:=⟨L, K, |=⟩ is a
logical system, and C a calculus for L.
Definition 14.0.6.
235
Let L be the formal language of a logical system, then an inference rule over L is
a decidable n + 1 ary relation on L. Inference rules as traditionally written as
A1 . . . An
N
C
where A1 , . . ., An and C are formula schemata for L and N is a name.
The Ai are called assumptions of N , and C is called its conclusion.
Definition 14.0.7. An inference rule without assumptions is called an axiom.

Definition 14.0.8. Let L:=⟨L, K, |=⟩ be a logical system, then we call a set C of
inference rules over L a calculus (or inference system) for L.
With formula schemata we mean representations of sets of formulae, we use boldface uppercase
letters as (meta)-variables for formulae, for instance the formula schema A ⇒ B represents the set
of formulae whose head is ⇒.
Derivations
Definition 14.0.9.Let L:=⟨L, K, |=⟩ be a logical system and C a calculus for L,
then a C-derivation of a formula C∈L from a set H ⊆ L of hypotheses (write
H⊢C C) is a sequence A1 , . . ., Am of L-formulae, such that
Am = C, (derivation culminates in C)
for all 1≤i≤m, either Ai ∈H, or (hypothesis)
Al 1 . . . Al k
there is an inference rule in C with lj <i for all j≤k. (rule
Ai
application)
We can also see a derivation as a derivation tree, where the Alj are the children of
the node Ak .
Example 14.0.10.
In the propositional Hilbert calculus H0 we have

K
the derivation P ⊢H0 Q⇒P : the sequence is P ⇒ P ⇒Q⇒P P
Q ⇒ P , P , Q ⇒ P and the corresponding tree on MP
the right. Q⇒P
Inference rules are relations on formulae represented by formula schemata (where boldface, up-
percase letters are used as meta-variables for formulae). For instance, in Example 14.0.10 the
A⇒B A
inference rule was applied in a situation, where the meta-variables A and B were
B
instantiated by the formulae P and Q ⇒ P .
As axioms do not have assumptions, they can be added to a derivation at any time. This is just
what we did with the axioms in Example 14.0.10.
Formal Systems
236 CHAPTER 14. FORMAL SYSTEMS
Let ⟨L, K, |=⟩ be a logical system and C a calculus, then ⊢C is a derivation relation
and thus ⟨L, K, |=, ⊢C ⟩ a derivation system.
Therefore we will sometimes also call ⟨L, K, |=, C ⟩ a formal system, iff L:=⟨L, K, |=⟩
is a logical system, and C a calculus for L.
Definition 14.0.11.
Let C be a calculus, then a C-derivation ∅⊢C A is called a proof of A and if one
exists (write ⊢C A) then A is called a C-theorem.
Definition 14.0.12.
An inference rule I is called admissible in a calculus C, if the extension of C by I
does not yield new theorems.
A1 . . . An
C
Observation 14.0.14. Derivable inference rules are admissible, but not the other
way around.
The notion of a formal system encapsulates the most general way we can conceptualize a system
with a calculus, i.e. a system in which we can do “formal reasoning”.
Chapter 15
Propositional Reasoning: SAT

Solvers
15.1 Introduction
Reminder: Our Agenda for Propositional Logic

chapter 12: Basic definitions and concepts; machine-oriented calculi
Sets up the framework. Tableaux and resolution are the quintessential reasoning
procedure underlying most successful SAT solvers.
This chapter: The Davis Putnam procedure and clause learning.

State-of-the-art algorithms for reasoning about propositional logic, and an im-
portant observation about how they behave.
SAT: The Propositional Satisfiability Problem

Definition 15.1.1. The SAT problem (SAT): Given a propositional formula A,
decide whether or not A is satisfiable. We denote the class of all SAT problems
with SAT
The SAT problem was the first problem proved to be NP-complete!
A is commonly assumed to be in CNF. This is without loss of generality, because any

A can be transformed into a satisfiability-equivalent CNF formula (cf. chapter 12)
in polynomial time.
Active research area, annual SAT conference, lots of tools etc. available: http:
//www.satlive.org/
Definition 15.1.2. Tools addressing SAT are commonly referred to as SAT solvers.
237
238 CHAPTER 15. PROPOSITIONAL REASONING: SAT SOLVERS
Recall: To decide whether KB |= A, decide satisfiability of θ:=KB ∪ {¬A}: θ is

unsatisfiable iff KB |= A.
Consequence: Deduction can be performed using SAT solvers.
SAT vs. CSP

Recall: Constraint network ⟨V , D, C ⟩ has variables v∈V with finite domains
Dv ∈D, and binary constraints C uv ∈C which are relations over u, v specifying the
permissible combined assignments to u and v. One extension is to allow constraints
of higher arity.
Observation 15.1.3 (SAT: A kind of CSP). SAT can be viewed as a CSP problem
in which all variable domains are Boolean, and the constraints have unbounded arity.
Theorem 15.1.4 (Encoding CSP as SAT). Given any constraint network C, we
can in low order polynomial time construct a CNF formula A(C) that is satisfiable
iff C is solvable.
Proof: We design a formula, relying on known transformation to CNF
1. encode multi-XOR for each variable
2. encode each constraint by DNF over relation
3. Running time: O(nd2 +md2 ) where n is the number of variables, d the domain
size, and m the number of constraints.
Upshot: Anything we can do with CSP, we can (in principle) do with SAT.
Example Application: Hardware Verification

Example 15.1.5 (Hardware Verification).
Counter, repeatedly from c = 0 to c = 2.
2 bits x1 and x0 ; c = 2 ∗ x1 + x0 .
(FF=
b Flip-Flop, D =
b Data IN, CLK =
b Clock)
To Verify: If c < 3 in current clock cycle,
then c < 3 in next clock cycle.
Step 1: Encode into propositional logic.

Propositions: x1 , x0 ; and y 1 , y 0 (value in next cycle).
Transition relation: y 1 ⇔ y 0 ; y 0 ⇔ (¬(x1 ∨ x0 )).
Initial state: ¬(x1 ∧ x0 ).
Error property: x1 ∧ y 0 .
15.2. DAVIS-PUTNAM 239
Step 2: Transform to CNF, encode as a clause set ∆.

Clauses: y 1 F ∨ x0 T , y 1 T ∨ x0 F , y 0 T ∨ x1 T ∨ x0 T , y 0 F ∨ x1 F , y 0 F ∨ x0 F , x1 F ∨ x0 F ,
y1 T , y0 T .
Step 3: Call a SAT solver (up next).

The Davis-Putnam (Logemann-Loveland) Procedure: How to systematically
test satisfiability?
The quintessential SAT solving procedure, DPLL.
DPLL is (A Restricted Form of) Resolution: How does this relate to what we
did in the last chapter?
Mathematical understanding of DPLL.

Why Did Unit Propagation Yield a Conflict?: How can we analyze which
mistakes were made in “dead” search branches?
Knowledge is power, see next.
Clause Learning: How can we learn from our mistakes?

One of the key concepts, perhaps the key concept, underlying the success of
SAT.
Phase Transitions – Where the Really Hard Problems Are: Are all formulas
“hard” to solve?
The answer is “no”. And in some cases we can figure out exactly when they
are/aren’t hard to solve.
15.2 The Davis-Putnam (Logemann-Loveland) Procedure

The DPLL Procedure

Definition 15.2.1. The Davis Putnam procedure (DPLL) is a SAT solver called
on a clause set ∆ and the empty assignment ϵ. It interleaves unit propagation (UP)
and splitting:
function DPLL(∆,I) returns a partial assignment I, or ‘‘unsatisfiable’’
/∗ Unit Propagation (UP) Rule: ∗/
∆′ := a copy of ∆; I ′ := I
while ∆′ contains a unit clause C = P α do
extend I ′ with [α/P ], clause−set simplify ∆′
/∗ Termination Test: ∗/
if 2∈∆′ then return ‘‘unsatisfiable’’
if ∆′ = {} then return I ′
/∗ Splitting Rule: ∗/
select some proposition P for which I ′ is not defined
I ′′ := I ′ extended with one truth value for P ; ∆′′ := a copy of ∆′ ; simplify ∆′′
if I ′′′ := DPLL(∆′′ ,I ′′ ) ̸= ‘‘unsatisfiable’’ then return I ′′′
I ′′ := I ′ extended with the other truth value for P ; ∆′′ := ∆′ ; simplify ∆′′
return DPLL(∆′′ ,I ′′ )
In practice, of course one uses flags etc. instead of “copy”.
DPLL: Example (Vanilla1)
Example 15.2.2 (UP and Splitting). Let ∆:=(P T ∨QT ∨RF ;P F ∨QF ;RT ;P T ∨QF )
1. UP Rule: R7→T
P T ∨ QT ; P F ∨ QF ; P T ∨ QF
2. Splitting Rule:
2a. P 7→F 2b. P 7→T
QT ; QF QF
3a. UP Rule: Q7→T 3b. UP Rule: Q7→F
2 clause set empty
returning “unsatisfiable” returning “R7→T, P 7→T, Q7→F
DPLL: Example (Vanilla2)

Observation: Sometimes UP is all we need.
Example 15.2.3. Let ∆:=(QF ∨ P F ; P T ∨ QF ∨ RF ∨ S F ; QT ∨ S F ; RT ∨ S F ; S T )

1. UP Rule: S7→T
QF ∨ P F ; P T ∨ QF ∨ RF ; QT ; RT
2. UP Rule: Q7→T
P F ; P T ∨ RF ; RT
3. UP Rule: R7→T
PF ; PT
4. UP Rule: P 7→T
2

15.3. DPLL =
b (A RESTRICTED FORM OF) RESOLUTION 241
DPLL: Example (Redundance1)

Example 15.2.4.
We introduce some nasty redundance to make DPLL slow.
∆:=(P F ∨ QF ∨ RT ; P F ∨ QF ∨ RF ; P F ∨ QT ∨ RT ; P F ∨ QT ∨ RF )
DPLL on ∆ ; Θ with Θ:=(X 1 T ∨ . . . ∨ X n T ; X 1 F ∨ . . . ∨ X n F )
P F
T
X1
T F
Xn Xn
T F T F
Q Q Q Q
T F T F T F T F
RT ; 2RT ; 2RT ; 2RT ; 2RT ; 2RT ; 2RT ; 2RT ; 2
Properties of DPLL
Unsatisfiable case: What can we say if “unsatisfiable” is returned?
In this case, we know that ∆ is unsatisfiable: Unit propagation is sound, in the
sense that it does not reduce the set of solutions.
Satisfiable case: What can we say when a partial interpretation I is returned?
Any extension of I to a complete interpretation satisfies ∆. (By construction,
I suffices to satisfy all clauses.)
Déjà Vu, Anybody?

DPLL =
b backtracking with inference, where inference =
b unit propagation.
Unit propagation is sound: It does not reduce the set of solutions.
Running time is exponential in worst case, good variable/value selection strate-
gies required.
15.3 b (A Restricted Form of) Resolution

DPLL =
A Video Nugget covering this section can be found at https://fau.tv/clip/id/27022. In
the last slide we have discussed the semantic properties of the DPLL procedure: DPLL is (refuta-
tion) sound and complete. Note that this is a theoretical resultin the sense that the algorithm is,
but that does not mean that a particular implementation of DPLL might not contain bugs that
affect sounds and completeness.

In the satisfiable case, DPLL returns a satisfying variable assignment, which we can check (in
low-order polynomial time) but in the unsatisfiable case, it just reports on the fact that it has tried
all branches and found nothing. This is clearly unsatisfactory, and we will address this situation
now by presenting a way that DPLL can output a resolution proof in the unsatisfiable case.
UP =
b Unit Resolution
Observation: The unit propagation (UP) rule corresponds to a calculus:
while ∆′ contains a unit clause {l} do
extend I ′ with the respective truth value for the proposition underlying l
simplify ∆′ /∗ remove false literals ∗/
Definition 15.3.1 (Unit Resolution). Unit resolution (UR) is the test calculus
consisting of the following inference rule:
C ∨ P α P β α ̸= β
UR
C
Unit propagation =
b resolution restricted to cases where one parent is unit clause.
Observation 15.3.2 (Soundness). UR is refutation sound. (since resolution is)

Observation 15.3.3 (Completeness). UR is not refutation complete (alone).
Example 15.3.4. P T ∨ QT ; P T ∨ QF ; P F ∨ QT ; P F ∨ QF is satisfiable but UR
cannot derive the empty clause 2.
UR makes only limited inferences, as long as there are unit clauses. It does not
guarantee to infer everything that can be inferred.
DPLL vs. Resolution

Definition 15.3.5. We define the number of decisions of a DPLL run as the total
number of times a truth value was set by either unit propagation or splitting.
Theorem 15.3.6. If DPLL returns “unsatisfiable” on ∆, then ∆⊢R0 2 with a
resolution proof whose length is at most the number of decisions.
Proof: Consider first DPLL without UP
1. Consider any leaf node N , for proposition X, both of whose truth values directly
result in a clause C that has become empty.
2. Then for X = F the respective clause C must contain X T ; and for X = T the
respective clause C must contain X F . Thus we can resolve these two clauses
to a clause C(N ) that does not contain X.
3. C(N ) can contain only the negations of the decision literals l1 , . . ., lk above N .
Remove N from the tree, then iterate the argument. Once the tree is empty,
we have derived the empty clause.
4. Unit propagation can be simulated via applications of the splitting rule, choos-
ing a proposition that is constrained by a unit clause: One of the two truth
values then immediately yields an empty clause.
15.3. DPLL =
b (A RESTRICTED FORM OF) RESOLUTION 243
DPLL vs. Resolution: Example (Vanilla2)

Observation: The proof of Theorem 15.3.6 is constructive, so we can use it as a
method to read of a resolution proof from a DPLL trace.
Example 15.3.7. We follow the steps in the proof of Theorem 15.3.6 for ∆:=(QF ∨
P F ; P T ∨ QF ∨ RF ∨ S F ; QT ∨ S F ; RT ∨ S F ; S T )
DPLL: (Without UP; leaves an- Resolution proof from that DPLL tree:
notated with clauses that became
empty)
S 2
F
T
Q ST SF ST
F
T
R QT ∨ S F QF ∨ S F QT ∨ S F
F
T
P F
RT ∨ S F QF ∨ RF ∨ S F RT ∨ S F
T
Q ∨ P F P T ∨ QF ∨ RF ∨ S F
F
QF ∨ P F P T ∨ QF ∨ RF ∨ S F
Intuition: From a (top-down) DPLL tree, we generate a (bottom-up) resolution

proof.
For reference, we give the full proof here.

Theorem 15.3.8. If DPLL returns “unsatisfiable” on ∆, then S : 2⊢R0 2 with a R0 -derivation
whose length is at most the number of decisions.
Proof: Consider first DPLL with no unit propagation.
1. If the search tree is not empty, then there exists a leaf node N , i.e., a node associated to
proposition X so that, for each value of X, the partial assignment directly results in an empty
clause.
2. Denote the parent decisions of N by L1 , . . ., Lk , where Li is a literal for proposition X i and
the search node containing X i is N i .
3. Denote the empty clause for X by C(N, X), and denote the empty clause for X F by C(N, X F ).
4. For each x∈{X T , X F } we have the following properties:
1. xF ∈C(N, x); and
2. C(N, x) ⊆ {xF , L1 , . . . , Lk }.
Due to , we can resolve C(N, X) with C(N, X F ); denote the outcome clause by C(N ).
5. We obviously have that (1) C(N ) ⊆ {L1 , . . . , Lk }.
6. The proof now proceeds by removing N from the search tree and attaching C(N ) at the Lk
branch of N k , in the role of C(N k , Lk ) as above. Then we select the next leaf node N ′ and
iterate the argument; once the tree is empty, by (1) we have derived the empty clause. What
we need to show is that, in each step of this iteration, we preserve the properties (a) and (b)
for all leaf nodes. Since we did not change anything in other parts of the tree, the only node
we need to show this for is N ′ :=N k .
7. Due to (1), we have (b) for N k . But we do not necessarily have (a): C(N ) ⊆ {L1 , . . . , Lk },
but there are cases where Lk ̸∈C(N ) (e.g., if X k is not contained in any clause and thus
branching over it was completely unnecessary). If so, however, we can simply remove N k and
all its descendants from the tree as well. We attach C(N ) at the L(k−1) branch of N (k−1) |,
in the role of C(N (k−1) , L(k−1) ). If L(k−1) ∈C(N ) then we have (a) for N ′ := N (k−1) and
can stop. If L(k−1) F ̸∈C(N ), then we remove N (k−1) and so forth, until either we stop with
(a), or have removed N 1 and thus must already have derived the empty clause (because
C(N ) ⊆ {L1 , . . . , Lk }\{L1 , . . . , Lk }).
8. Unit propagation can be simulated via applications of the splitting rule, choosing a proposi-
tion that is constrained by a unit clause: One of the two truth values then immediately yields
an empty clause.
DPLL vs. Resolution: Discussion

So What?: The theorem we just proved helps to understand DPLL:
DPLL is an effective practical method for conducting resolution proofs.
In fact: DPLL =
b tree resolution.
Definition 15.3.9. In a tree resolution, each derived clause C is used only once
(at its parent).
Problem: The same C must be derived anew every time it is used!
This is a fundamental weakness: There are inputs ∆ whose shortest tree reso-
lution proof is exponentially longer than their shortest (general) resolution proof.
Intuitively: DPLL makes the same mistakes over and over again.
Idea: DPLL should learn from its mistakes on one search branch, and apply the
learned knowledge to other branches.
To the rescue: clause learning (up next)
15.4 Why Did Unit Propagation Yield a Conflict?

DPLL: Example (Redundance1)

Example 15.4.1.
We introduce some nasty redundance to make DPLL slow.
15.4. UP CONFLICT ANALYSIS 245
P F
T
X1
T F
Xn Xn
T F T F
Q Q Q Q
T F T F T F T F
T T T T T T T T
R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2
How To Not Make the Same Mistakes Over Again?

It’s not that difficult, really:
(A) Figure out what went wrong.
(B) Learn to not do that again in the future.
And now for DPLL:
(A) Why did unit propagation yield a Conflict?

This Section. We will capture the “what went wrong” in terms of graphs over
literals set during the search, and their dependencies.
What can we learn from that information?:
A new clause! Next section.
Implication Graphs for DPLL

Definition 15.4.2. Let β be a branch in a DPLL derivation and P a variable on β
then we call
P α a choice literal if its value is set to α by the splitting rule.
P α an implied literal, if the value of P is set to α by the UP rule.
P α a conflict literal, if it contributes to a derivation of the empty clause.
Definition 15.4.3 (Implication Graph).
Let ∆ be a clause set, β a DPLL search branch on ∆. The implication graph Gimpl β
is the directed graph whose vertices are labeled with the choice and implied literals
along β, as well as a separate conflict vertex 2C for every clause C that became
empty on β.
Whereever a clause l1 , . . ., lk ∨l′ ∈∆ became unit with implied literal l′ , Gimpl
β includes
the edges (li ,l′ ).
Where C = l1 ∨ . . . ∨ lk ∈ ∆ became empty, Gimpl

β includes the edges (li ,2C ).
Question: How do we know that li are vertices in Gimpl

β ?
Answer: Because l1 ∨ . . . ∨ lk ∨ l′ became unit/empty.
Observation 15.4.4. Gimpl

β is acyclic.
Proof sketch: UP can’t derive l′ whose value was already set beforehand.
Intuition: The initial vertices are the choice literals and unit clauses of ∆.
Implication Graphs: Example (Vanilla1) in Detail
Example 15.4.5. Let ∆:=(P T ∨ QT ∨ RF ; P F ∨ QF ; RT ; P T ∨ QF ).

We look at the left branch of the derivation from Example 15.2.2:
1. UP Rule: R7→T
Implied literal RT . Implication graph:
P T ∨ QT ; P F ∨ QF ; P T ∨ QF
PF
2. Splitting Rule:
2a. P 7→F
Choice literal P F .
QT ; QF QT
3a. UP Rule: Q7→T
Implied literal QT
edges (RT ,QT ) and (P F ,QT ).
2
RT 2P T ∨QF
Conflict vertex 2P T ∨QF
edges (P F ,2P T ∨QF ) and (QT ,2P T ∨QF ).
Implication Graphs: Example (Redundance1)

Example 15.4.6.
Continuing from Example 15.4.5:
Choice literals: P T , (X 1 T ), . . ., (X n T ), QT . Implied literal: RT .
P F
T
X1
T F
Xn Xn
T F T F
Q Q Q Q
T F T F T F T F
T T T T T T T T
R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2
PT
QT X1 T ... Xn T
RT 2P T ∨QF ∨RT
Implication Graphs: Example (Redundance2)

Example 15.4.7.
∆ := P F ∨ QF ∨ RT ; P F ∨ QF ∨ RF ; P F ∨ QT ∨ RT ; P F ∨ QT ∨ RF
Θ := X 1 T ∨ . . . ∨ X n T ; X 1 F ∨ . . . ∨ X n F
DPLL on ∆ ; Θ ; Φ with Φ:=(QF ∨ S T ; QF ∨ S F )

Choice literals: P T , (X 1 T ), . . ., (X n T ), QT . Implied literals:
PT
QT X1 T ... Xn T
RT 2P T ∨QF ∨RT ST 2
Implication Graphs: A Remark

The implication graph is not uniquely determined by the Choice literals.
It depends on “ordering decisions” during UP: Which unit clause is picked first.
Example 15.4.8. ∆ = P F ∨ QF ; QT ; P T
Option 1 Option 2
T
Q PT
2P F ∨QF PF 2P F ∨QF QF
Conflict Graphs
A conflict graph captures “what went wrong” in a failed node.
Definition 15.4.9 (Conflict Graph). Let ∆ be a clause set, and let Gimpl
β be the
implication graph for some search branch β of DPLL on ∆. A subgraph C of Gimpl
β
is a conflict graph if:
(i) C contains exactly one conflict vertex 2C .
(ii) If l′ is a vertex in C, then all parents of l′ , i.e. vertices li with a I edge (li ,l′ ),
are vertices in C as well.
(iii) All vertices in C have a path to 2C .
Conflict graph =
b Starting at a conflict vertex, backchain through the implication
graph until reaching choice literals.
Conflict-Graphs: Example (Redundance1)

Example 15.4.10.
DPLL on ∆ ; Θ with Θ:=(X 1 T ∨ . . . ∨ X 100 T ; X 1 F ∨ . . . ∨ X 100 F )
Choice literals: P T , (X 1 T ), . . ., (X 100 T ), QT . Implied literals: RT .
PT
QT X1 T ... Xn T
RT 2P T ∨QF ∨RT
Conflict Graphs: Example (Redundance2)

Example 15.4.11.
Continuing from Example 15.4.7 and Example 15.4.10:
Θ := X 1 T ∨ . . . ∨ X n T ; X 1 F ∨ . . . ∨ X n F
DPLL on ∆ ; Θ ; Φ with Φ:=(QF ∨ S T ; QF ∨ S F )

Choice literals: P T , (X 1 T ), . . ., (X n T ), QT . Implied literals: RT .
PT
QT X1 T ... Xn T
PT
QT X1 T ... Xn T
15.5 Clause Learning
Clause Learning
Observation: Conflict graphs encode the entailment relation.
Definition 15.5.1. Let ∆ be a clause set, C be a conflict graph at some time
point
W during a run of DPLL on ∆, and L be the choice literals in C, then we call
c:= l∈L l the learned clause for C.
Theorem 15.5.2. Let ∆, C, and c as in Definition 15.5.1, then ∆ |= c.
Idea: We can add learned clauses to DPLL derivations at any time without losing
soundness. (maybe this helps, if we have a good notion of learned clauses)
Definition 15.5.3. Clause learning is the process of adding learned clauses to

DPLL clause sets at specific points. (details coming up)
Clause Learning: Example (Redundance1)

Example 15.5.4.
Choice literals: P T , (X 1 T ), . . ., (X n T ), QT . Implied literals: RT .
PT
QT X1 T ... Xn T
RT 2P T ∨QF ∨RT
Learned clause: P F ∨ QF
The Effect of Learned Clauses (in Redundance1)

15.5. CLAUSE LEARNING 251
What happens after we learned a new clause C?

1. We add C into ∆. e.g. C = P F ∨ QF .
2. We retract the last choice l′ . e.g. the choice l′ = Q.
W
Observation: Let C be a learned clause, i.e. C = l∈L l, where L is the set of
conflict literals in a conflict graph G.
Before we learn C, G must contain the most recent choice l′ : otherwise, the conflict
would have occured earlier on.
So C = l1 T ∨ . . . ∨ lk T ∨ l′ where l1 , . . ., lk are earlier choices.
Example 15.5.5. l1 = P , C = P F ∨ QF , l′ = Q.
Observation:
Given the earlier choices l1 , . . . , lk , after we learned the new clause C = l1 ∨ . . . ∨
lk ∨ l′ , the value of l′ is now set by UP!
So we can continue:
3. We set the opposite choice l′ as an implied literal.
e.g. QF as an implied literal.
4. We run UP and analyze conflicts.
Learned clause: earlier choices only! e.g. C = P F , see next slide.
The Effect of Learned Clauses: Example (Redundance1)

Example 15.5.6.
Θ := X 1 T ∨ . . . ∨ X 100 T ; X 1 F ∨ . . . ∨ X 100 F
DPLL on ∆ ; Θ ; Φ with Φ:=(P F ∨ QF )

Choice literals: P T , (X 1 T ), . . ., (X 100 T ), QT . Implied literals: QF , RT .
PT
QF X1 T ... Xn T
RT 2
Learned clause: P F
NOT the same Mistakes over Again: (Redundance1)

Example 15.5.7.
P F
T
X1
T
Xn
T
Q
T F set by UP
T T
R ;2 R ;2
learn P F ∨ QF learn P F
Note: Here, the problem could be avoided by splitting over different variables.
Problem: This is not so in general! (see next slide)
Clause Learning vs. Resolution
Recall: DPLL =
b tree resolution (from slide 388)
1. in particular: each derived clause C (not in ∆) is derived anew every time it is
used.
2. Problem: there are ∆ whose shortest tree resolution proof is exponentially longer
than their shortest (general) resolution proof.
Good News: This is no longer the case with clause learning!

1. We add each learned clause C to ∆, can use it as often as we like.
2. Clause learning renders DPLL equivalent to full resolution [BKS04; PD09]. (In-
howfar exactly this is the case was an open question for ca. 10 years, so it’s not
as easy as I made it look here . . . )
In particular: Selecting different variables/values to split on can provably not

bring DPLL up to the power of DPLL+Clause Learning. (cf. slide 403, and previous
slide)
15.5. CLAUSE LEARNING 253
“DPLL + Clause Learning”?

Disclaimer: We have only seen how to learn a clause from a conflict.
We will not cover how the overall DPLL algorithm changes, given this learning.
Slides 401 – 403 are merely meant to give a rough intuition on “backjumping”.
Definition 15.5.8 (Just for the record). (not exam or exercises relevant)
One could run “DPLL + Clause Learning” by always backtracking to the maximal-
level choice variable contained in the learned clause.
The actual algorithm is called Conflict Directed Clause Learning (CDCL), and
differs from DPLL more radically:
let L := 0; I := ∅
repeat
execute UP
if a conflict was reached then /∗ learned clause C = l1 ∨ . . . ∨ lk ∨ l′ ∗/
if L = 0 then return UNSAT
L := maxki=1 level(li ); erase I below L
add C into ∆; add l′ to I at level L
else
if I is a total interpretation then return I
choose a new decision literal l; add l to I at level L
L := L + 1
Remarks
Which clause(s) to learn?:
While we only select choice literals, much more can be done.

For any cut through the conflict graph, with Choice literals on the “left hand”
side of the cut and the conflict literals on the right-hand side, the literals on the
left border of the cut yield a learnable clause.
Must take care to not learn too many clauses . . .
Origins of clause learning:
Clause learning originates from “explanation-based (no-good) learning” devel-
oped in the CSP community.
The distinguishing feature here is that the “no-good” is a clause:
The exact same type of constraint as the rest of ∆.

15.6 Phase Transitions: Where the Really Hard Problems

Are
Where Are the Hard Problems?

SAT is NP hard. Worst case for DPLL is O(2n ), with n propositions.
Imagine I gave you as homework to make a formula family {φ} where DPLL running
time necessarily is in the order of O(2n ).
I promise you’re not gonna find this easy . . . (although it is of course possible:
e.g., the “Pigeon Hole Problem”).
People noticed by the early 90s that, in practice, the DPLL worst case does not
tend to happen.
Modern SAT solvers successfully tackle practical instances where n > 1.000.000.
Where Are the Hard Problems?

So, what’s the problem: Science is about understanding the world.
Are “hard cases” just pathological outliers?
Can we say something about the typical case?
Difficulty 1: What is the “typical case” in applications? E.g., what is the “average”
Hardware Verification instance?
Consider precisely defined random distributions instead.
Difficulty 2: Search trees get very complex, and are difficult to analyze math-
ematically, even in trivial examples. Never mind examples of practical relevance
...
The most successful works are empirical. (Interesting theory is mainly concerned
with hand-crafted formulas, like the Pigeon Hole Problem.)
Phase Transitions in SAT [MSL92]

Fixed clause length model: Fix clause length k; n variables.
Generate m clauses, by uniformly choosing k variables P for each clause C, and for
each variable P deciding uniformly whether to add P or P F into C.
Order parameter: Clause/variable ratio m
n.
Phase transition: (Fixing k = 3, n = 50)

15.6. PHASE TRANSITIONS 255
Does DPLL Care?

Oh yes, it does: Extreme running time peak at the phase transition!
Why Does DPLL Care?

Intuition:
Under-Constrained: Satisfiability likelihood close to 1. Many solutions, first

DPLL search path usually successful. (“Deep but narrow”)
Over-Constrained: Satisfiability likelihood close to 0. Most DPLL search paths
short, conflict reached after few applications of splitting rule. (“Broad but shal-
low”)
Critically Constrained: At the phase transition, many almost-successful DPLL
search paths. (“Close, but no cigar”)
The Phase Transition Conjecture

Definition 15.6.1. We say that a class P of problems exhibits a phase transition, if
there is an order parameter o, i.e. a structural parameter of P , so that almost all the
hard problems of P cluster around a critical value c of o and c separates one region
of the problem space from another, e.g. over-constrained and under-constrained
regions.
All NP-complete problems exhibit at least one phase transition.
[CKT91] confirmed this for Graph Coloring and Hamiltonian Circuits. Later work
confirmed it for SAT (see previous slides), and for numerous other NP-complete
problems.
Why Should We Care?

Enlightenment:
Phase transitions contribute to the fundamental understanding of the behavior
of search, even if it’s only in random distributions.
There are interesting theoretical connections to phase transition phenomena in
physics. (See [GS05] for a short summary.)
Ok, but what can we use these results for?:
Benchmark design: Choose instances from phase transition region.
Commonly used in competitions etc. (In SAT, random phase transition
formulas are the most difficult for DPLL style searches.)
Predicting solver performance: Yes, but very limited because:
All this works only for the particular considered distributions of instances! Not
meaningful for any other instances.
15.7 Conclusion
Summary
SAT solvers decide satisfiability of CNF formulas. This can be used for deduction,
and is highly successful as a general problem solving technique (e.g., in Verification).
DPLL = b backtracking with inference performed by unit propagation (UP), which
iteratively instantiates unit clauses and simplifies the formula.
DPLL proofs of unsatisfiability correspond to a restricted form of resolution. The

restriction forces DPLL to “makes the same mistakes over again”.
Implication graphs capture how UP derives conflicts. Their analysis enables us to
do clause learning. DPLL with clause learning is called CDCL. It corresponds to full
resolution, not “making the same mistakes over again”.
CDCL is state of the art in applications, routinely solving formulas with millions of
propositions.
In particular random formula distributions, typical problem hardness is characterized
by phase transitions.
State of the Art in SAT

SAT competitions:
Since beginning of the 90s http://www.satcompetition.org/

random vs. industrial vs. handcrafted benchmarks.
Largest industrial instances: > 1.000.000 propositions.
State of the art is CDCL:
Vastly superior on handcrafted and industrial benchmarks.

Key techniques: clause learning! Also: Efficient implementation (UP!), good
branching heuristics, random restarts, portfolios.
What about local search?:
Better on random instances.

No “dramatic” progress in last decade.
Parameters are difficult to adjust.
But – What About Local Search for SAT?

There’s a wealth of research on local search for SAT, e.g.:
Definition 15.7.1. The GSAT algorithm OUTPUT: a satisfying truth assignment
of ∆, if found
function GSAT (∆, M axF lips M axT ries

for i :=1 to M axT ries
I := a randomly−generated truth assignment
for j :=1 to M axF lips
if I satisfies ∆ then return I
X:= a proposition reversing whose truth assignment gives
the largest increase in the number of satisfied clauses
I := I with the truth assignment of X reversed
end for
end for
return ‘‘no satisfying assignment found’’
local search is not as successful in SAT applications, and the underlying ideas are
very similar to those presented in section 8.6 (Not covered here)

Variable/value selection heuristics: A whole zoo is out there.
Implementation techniques: One of the most intensely researched subjects. Fa-
mous “watched literals” technique for UP had huge practical impact.
Local search: In space of all truth value assignments. GSAT (slide 416) had huge
impact at the time (1992), caused huge amount of follow-up work. Less intensely
researched since clause learning hit the scene in the late 90s.
Portfolios: How to combine several SAT solvers effectively?
Random restarts: Tackling heavy-tailed runtime distributions.
Tractable SAT: Polynomial-time sub-classes (most prominent: 2-SAT, Horn for-

mulas).
MaxSAT: Assign weight to each clause, maximize weight of satisfied clauses (=
optimization version of SAT).
Resolution special cases: There’s a universe in between unit resolution and full
resolution: trade off inference vs. search.
Proof complexity: Can one resolution special case X simulate another one Y
polynomially? Or is there an exponential separation (example families where X is
exponentially less effective than Y )?
Suggested Reading:
• Chapter 7: Logical Agents, Section 7.6.1 [RN09].
– Here, RN describe DPLL, i.e., basically what I cover under “The Davis-Putnam (Logemann-
Loveland) Procedure”.
– That’s the only thing they cover of this Chapter’s material. (And they even mark it as “can
be skimmed on first reading”.)
– This does not do the state of the art in SAT any justice.
• Chapter 7: Logical Agents, Sections 7.6.2, 7.6.3, and 7.7 [RN09].
– Sections 7.6.2 and 7.6.3 say a few words on local search for SAT, which I recommend as
additional background reading. Section 7.7 describes in quite some detail how to build an
agent using propositional logic to take decisions; nice background reading as well.
Chapter 16
First-Order Predicate Logic
16.1 Motivation: A more Expressive Language

Let’s
Let’s Talk
Talk About
AboutBlocks,
Blocks,Baby
Baby. .... .
Question: What do you see here?
I Question: What do you see here?
A D B E C
I You say: “All blocks are red”; “All blocks are on the table”; “A is a block”.
You say: “All blocks are red”; “All blocks are on the table”; “A is a block”.
I And now: Say it in propositional logic!
And now: Say it in propositional logic!
Answer: “isRedA”,“isRedB”, . . . , “onTableA”, “onTableB”, . . . , “isBlockA”, . . .
Wait a sec!: Why don’t we just say, e.g., “AllBlocksAreRed” and “isBlockA”?
Problem: Could we conclude that A is red? (No)
These statements are atomic (just strings); their inner structure (“all blocks”, “is a
block”) is not captured.
Idea: Predicate Logic (PL1 ) extends propositional logic with the ability to explicitly
speak about objects and their properties.
How?: Variables ranging over objects, predicates describing object properties, . . .
Example 16.1.1. “∀x block(x) ⇒ red(x)”; “block(A)”
Let’s Talk About the Wumpus Instead?
259
260 CHAPTER 16. FIRST-ORDER PREDICATE LOGIC
Percepts: [Stench, Breeze, Glitter , Bump, Scream]

Cell adjacent to Wumpus: Stench (else: None).
Cell adjacent to Pit: Breeze (else: None).

Cell that contains gold: Glitter (else: None).
You walk into a wall: Bump (else: None).
Wumpus shot by arrow: Scream (else: None).
Say, in propositional logic: “Cell adjacent to Wumpus: Stench.”

W 1,1 ⇒ S 1,2 ∧ S 2,1
W 1,2 ⇒ S 2,2 ∧ S 1,1 ∧ S 1,3
W 1,3 ⇒ S 2,3 ∧ S 1,2 ∧ S 1,4
...
Note: Even when we can describe the problem suitably, for the desired reasoning,
the propositional formulation typically is way too large to write (by hand).
PL1 solution: “∀x Wumpus(x) ⇒ (∀y adj(x, y) ⇒ stench(y))”
Blocks/Wumpus, Who Cares? Let’s Talk About Numbers!

Even worse!
Example 16.1.2 (Integers). A limited vocabulary to talk about these
The objects: {1, 2, 3, . . . }.
Predicate 1: “even(x)” should be true iff x is even.
Predicate 2: “eq(x, y)” should be true iff x = y.
Function: succ(x) maps x to x + 1.
Old problem: Say, in propositional logic, that “1 + 1 = 2”.
Inner structure of vocabulary is ignored (cf. “AllBlocksAreRed”).
PL1 solution: “eq(succ(1), 2)”.
New Problem: Say, in propositional logic, “if x is even, so is x + 2”.
It is impossible to speak about infinite sets of objects!
PL1 solution: “∀x even(x) ⇒ even(succ(succ(x)))”.
Now We’re Talking

16.1. MOTIVATION: A MORE EXPRESSIVE LANGUAGE 261
Example 16.1.3.
∀n gt(n, 2) ⇒ ¬(∃a, b, c eq(plus(pow(a, n), pow(b, n)), pow(c, n)))
Read: Forall n > 2, there are a, b, c, such that an + bn = cn (Fermat’s last

theorem)
Theorem proving in PL1: Arbitrary theorems, in principle.
Fermat’s last theorem is of course infeasible, but interesting theorems can and
have been proved automatically.
See http://en.wikipedia.org/wiki/Automated_theorem_proving.
Note: Need to axiomatize “Plus”, “PowerOf”, “Equals”. See http://en.wikipedia.
org/wiki/Peano_axioms
What Are the Practical Relevance/Applications?

. . . even asking this question is a sacrilege:
(Quotes from Wikipedia)

“In Europe, logic was first developed by Aristotle. Aristotelian logic became
widely accepted in science and mathematics.”
“The development of logic since Frege, Russell, and Wittgenstein had a profound
influence on the practice of philosophy and the perceived nature of philosophical
problems, and Philosophy of mathematics.”
“During the later medieval period, major efforts were made to show that Aris-
totle’s ideas were compatible with Christian faith.”
(In other words: the church issued for a long time that Aristotle’s ideas were
incompatible with Christian faith.)
What Are the Practical Relevance/Applications?

You’re asking it anyhow:
Logic programming. Prolog et al.

Databases. Deductive databases where elements of logic allow to conclude
additional facts. Logic is tied deeply with database theory.
Semantic technology. Mega-trend since > a decade. Use PL1 fragments to
annotate data sets, facilitating their use and analysis.
Prominent PL1 fragment: Web Ontology Language OWL.
Prominent data set: The WWW. (semantic web)
Assorted quotes on Semantic Web and OWL:
The brain of humanity.

The Semantic Web will never work.
A TRULY meaningful way of interacting with the Web may finally be here:
the Semantic Web. The idea was proposed 10 years ago. A triumvirate of
internet heavyweights – Google, Twitter, and Facebook – are making it real.
(A Few) Semantic Technology Applications
Web Queries Jeopardy (IBM Watson)
Context-Aware Apps Healthcare

This Chapter: Basic definitions and concepts; normal forms.
Sets up the framework and basic operations.
Syntax: How to write PL1 formulas? (Obviously required)
Semantics: What is the meaning of PL1 formulas? (Obviously required.)
Normal Forms: What are the basic normal forms, and how to obtain them?
(Needed for algorithms, which are defined on these normal forms.)
Next Chapter: Compilation to propositional reasoning; unification; lifted resolu-
tion/tableau.
Algorithmic principles for reasoning about predicate logic.

16.2. FIRST-ORDER LOGIC 263
16.2 First-Order Logic

A Video Nugget covering this section can be found at https://fau.tv/clip/id/25093. First-
order logic is the most widely used formal systems for modelling knowledge and inference processes.
It strikes a very good bargain in the trade-off between expressivity and conceptual and compu-
tational complexity. To many people first-order logic is “the logic”, i.e. the only logic worth
considering, its applications range from the foundations of mathematics to natural language se-
mantics.
First-Order Predicate Logic (PL1 )
Coverage: We can talk about (All humans are mortal)

individual things and denote them by variables or constants
properties of individuals, (e.g. being human or mortal)
relations of individuals, (e.g. sibling_of relationship)
functions on individuals, (e.g. the f ather_of function)
We can also state the existence of an individual with a certain property, or the
universality of a property.
But we cannot state assertions like
There is a surjective function from the natural numbers into the reals.
First-Order Predicate Logic has many good properties (complete calculi,
compactness, unitary, linear unification,. . . )
But too weak for formalizing: (at least directly)
natural numbers, torsion groups, calculus, . . .

generalized quantifiers (most, few,. . . )
16.2.1 First-Order Logic: Syntax and Semantics

The syntax and semantics of first-order logic is systematically organized in two distinct layers:
one for truth values (like in propositional logic) and one for individuals (the new, distinctive
feature of first-order logic). The first step of defining a formal language is to specify the
alphabet, here the first-order signatures and their components.
PL1 Syntax (Signature and Variables)

Definition 16.2.1.
First-order logic (PL1 ), is a formal system extensively used in mathematics, philos-
ophy, linguistics, and computer science. It combines propositional logic with the
ability to quantify over individuals.
PL1 talks about two kinds of objects: (so we have two kinds of symbols)
truth values by reusing PL0

individuals, e.g. numbers, foxes, Pokémon,. . .

Definition 16.2.2. A first-order signature consists of (all disjoint; k∈N)
connectives: Σo = {T , F , ¬, ∨, ∧, ⇒, ⇔, . . .} (functions on truth values)
function constants: Σfk = {f , g, h, . . .} (functions on individuals)
p
predicate constants: Σk = {p, q, r, . . .} (relationships among individuals.)
(Skolem constants: Σk (witness constructors; countably ∞)
sk
= {fk1 , fk2 , . . .})
∗
S ∗
We take Σι to be all of these together: Σι :=Σ ∪Σ ∪Σ , where Σ := k∈N Σk
f p sk
and define Σ:=Σι ∪ Σo .

Definition 16.2.3. We assume a set of individual variables: Vι :={X, Y , Z, . . .}.
(countably ∞)
We make the deliberate, but non-standard design choice here to include Skolem constants into
the signature from the start. These are used in inference systems to give names to objects and
construct witnesses. Other than the fact that they are usually introduced by need, they work
exactly like regular constants, which makes the inclusion rather painless. As we can never predict
how many Skolem constants we are going to need, we give ourselves countably infinitely many for
every arity. Our supply of individual variables is countably infinite for the same reason. The
formulae of first-order logic is built up from the signature and variables as terms (to represent
individuals) and propositions (to represent propositions). The latter include the propositional
connectives, but also quantifiers.
PL1 Syntax (Formulae)
Definition 16.2.4. Terms: A∈wff ι (Σι , Vι ) (denote individuals)

Vι ⊆ wff ι (Σι , Vι ),
if f ∈Σfk and Ai ∈wff ι (Σι , Vι ) for i≤k, then f (A1 , . . ., Ak )∈wff ι (Σι , Vι ).
Definition 16.2.5. if Propositions: A∈wff o (Σι , Vι ): (denote truth values)
if p∈Σpk and Ai ∈wff ι (Σι , Vι ) for i≤k, then p(A1 , . . ., Ak )∈wff o (Σι , Vι ),
if A, B∈wff o (Σι , Vι ) and X∈Vι , then T , A ∧ B, ¬A, ∀X A∈wff o (Σι , Vι ). ∀ is
a binding operator called the universal quantifier.
Definition 16.2.6. We define the connectives F , ∨, ⇒, ⇔ via the abbreviations

A ∨ B:=¬(¬A ∧ ¬B), A ⇒ B:=¬A ∨ B, A ⇔ B:=(A ⇒ B) ∧ (B ⇒ A), and
F :=¬T . We will use them like the primary connectives ∧ and ¬
Definition 16.2.7. We use ∃X A as an abbreviation for ¬(∀X ¬A). ∃ is a binding
operator called the existential quantifier.
Definition 16.2.8. Call formulae without connectives or quantifiers atomic else

complex.
Note: that we only need e.g. conjunction, negation, and universal quantification, all other
logical constants can be defined from them (as we will see when we have fixed their interpreta-
tions).
Alternative Notations for Quantifiers
Here Elsewhere
V
∀x A x A (x)A
W
∃x A xA
The introduction of quantifiers to first-order logic brings a new phenomenon: variables that are
under the scope of a quantifiers will behave very differently from the ones that are not. Therefore
we build up a vocabulary that distinguishes the two.
Free and Bound Variables

Definition 16.2.9. We call an occurrence of a variable X bound in a formula A, iff
it occurs in a sub-formula ∀X B of A. We call a variable occurrence free otherwise.
For a formula A, we will use BVar(A) (and free(A)) for the set of bound (free)
variables of A, i.e. variables that have a free/bound occurrence in A.
Definition 16.2.10. We define the set free(A) of frees variable of a formula A:
free(X):={X}
free(f (A1 , . . ., An )):=S 1≤i≤n free(Ai )
S
free(p(A1 , . . ., An )):= 1≤i≤n free(Ai )
free(¬A):=free(A)
free(A ∧ B):=free(A) ∪ free(B)
free(∀X A):=free(A)\{X}
Definition 16.2.11. We call a formula A closed or ground, iff free(A) = ∅. We

call a closed proposition a sentence, and denote the set of all ground terms with
cwff ι (Σι ) and the set of sentences with cwff o (Σι ).
Axiom 16.2.12. Bound variables can be renamed, i.e. any subterm ∀X B of a

formula A can be replaced by A′ :=(∀Y B′ ), where B′ arises from B by replacing
all X∈free(B) with a new variable Y that does not occur in A. We call A an
alphabetical variant of A.
We will be mainly interested in (sets of) sentences – i.e. closed propositions – as the representations
of meaningful statements about individuals. Indeed, we will see below that free variables do
not gives us expressivity, since they behave like constants and could be replaced by them in all
situations, except the recursive definition of quantified formulae. Indeed in all situations where
variables occur freely, they have the character of meta-variables, i.e. syntactic placeholders that
can be instantiated with terms when needed in an inference calculus.
The semantics of first-order logic is a Tarski-style set-theoretic semantics where the atomic syn-
tactic entities are interpreted by mapping them into a well-understood structure, a first-order
universe that is just an arbitrary set.
Semantics of PL1 (Models)

Definition 16.2.13. We inherit the universe Do = {T, F} of truth values from PL0
and assume an arbitrary universe Dι ̸= ∅ of individuals (this choice is a parameter
to the semantics)
Definition 16.2.14. An interpretation I assigns values to constants, e.g.
I(¬) : Do →Do with T7→F, F7→T, and I(∧) = . . . (as in PL0 )

I : Σfk →Dι k → Dι (interpret function symbols as arbitrary functions)
I: Σpk →P(Dι k ) (interpret predicates as arbitrary relations)
Definition 16.2.15. A variable assignment φ : Vι →Dι maps variables into the
universe.
Definition 16.2.16. A model M = ⟨Dι , I⟩ of PL1 consists of a universe Dι and

an interpretation I.
We do not have to make the universe of truth values part of the model, since it is always the same;
we determine the model by choosing a universe and an interpretation function.
Given a first-order model, we can define the evaluation function as a homomorphism over the
construction of formulae.
Semantics of PL1 (Evaluation)

Definition 16.2.17.
Given a model ⟨D, I⟩, the value function I φ is recursively defined: (two parts:
terms & propositions)
I φ : wff ι (Σι , Vι )→Dι assigns values to terms.
I φ (X):=φ(X) and
I φ (f (A1 , . . ., Ak )):=I(f )(I φ (A1 ), . . ., I φ (Ak ))
I φ : wff o (Σι , Vι )→Do assigns values to formulae:
I φ (T ) = I(T ) = T,
I φ (¬A) = I(¬)(I φ (A))
I φ (A ∧ B) = I(∧)(I φ (A), I φ (B)) (just as in PL0 )

I φ (p(A1 , . . ., Ak )):=T, iff ⟨I φ (A1 ), . . ., I φ (Ak )⟩∈I(p)
I φ (∀X A):=T, iff I φ,[a/X] (A) = T for all a∈Dι .
Definition 16.2.18 (Assignment Extension). Let φ be a variable assignment into

D and a∈D, then φ,[a/X] is called the extension of φ with [a/X] and is defined
as {(Y ,a)∈φ|Y ̸= X} ∪ {(X,a)}: φ,[a/X] coincides with φ off X, and gives the
result a there.
The only new (and interesting) case in this definition is the quantifier case, there we define the
value of a quantified formula by the value of its scope – but with an extension of the incoming
variable assignment. Note that by passing to the scope A of ∀x A, the occurrences of the variable
x in A that were bound in ∀x A become free and are amenable to evaluation by the variable
assignment ψ:=φ,[a/X]. Note that as an extension of φ, the assignment ψ supplies exactly the
right value for x in A. This variability of the variable assignment in the definition of the value
function justifies the somewhat complex setup of first-order evaluation, where we have the (static)
interpretation function for the symbols from the signature and the (dynamic) variable assignment
for the variables.
Note furthermore, that the value I φ (∃x A) of ∃x A, which we have defined to be ¬(∀x ¬A) is
true, iff it is not the case that I φ (∀x ¬A) = I ψ (¬A) = F for all a∈Dι and ψ:=φ,[a/X]. This is
the case, iff I ψ (A) = T for some a∈Dι . So our definition of the existential quantifier yields the
appropriate semantics.
Semantics Computation: Example

Example 16.2.19. We define an instance of first-order logic:
Signature: Let Σf0 :={j, m}, Σf1 :={f }, and Σp2 :={o}
Universe: Dι :={J, M }
Interpretation: I(j):=J, I(m):=M , I(f )(J):=M , I(f )(M ):=M , and I(o):={(M ,J)}.
Then ∀X o(f (X), X) is a sentence and with ψ:=φ,[a/X] for a∈Dι we have
I φ (∀X o(f (X), X)) = T iff I ψ (o(f (X), X)) = T for all a∈Dι
iff (I ψ (f (X)),I ψ (X))∈I(o) for all a∈{J, M }
iff (I(f )(I ψ (X)),ψ(X))∈{(M ,J)} for all a∈{J, M }
iff (I(f )(ψ(X)),a) = (M ,J) for all a∈{J, M }
iff I(f )(a) = M and a = J for all a∈{J, M }
But a ̸= J for a = M , so I φ (∀X o(f (X), X)) = F in the model ⟨Dι , I⟩.
16.2.2 First-Order Substitutions

We will now turn our attention to substitutions, special formula-to-formula mappings that
operationalize the intuition that (individual) variables stand for arbitrary terms.
Substitutions on Terms
Intuition: If B is a term and X is a variable, then we denote the result of
systematically replacing all occurrences of X in a term A by B with [B/X](A).
Problem: What about [Z/Y ], [Y /X](X), is that Y or Z?
Folklore: [Z/Y ], [Y /X](X) = Y , but [Z/Y ]([Y /X](X)) = Z of course.
(Parallel application)
Definition 16.2.20.[for=sbstListfromto,sbstListdots,sbst]
Let wfe(Σ, V) be an expression language, then we call σ : V→wfe(Σ, V) a substi-
tution, iff the support supp(σ):={X|(X,A)∈σ, X ̸= A} of σ is finite. We denote
the empty substitution with ϵ.
Definition 16.2.21 (Substitution Application).

We define substitution application by
σ(c) = c for c∈Σ

σ(X) = A, iff A∈V and (X,A)∈σ.
σ(f (A1 , . . ., An )) = f (σ(A1 ), . . ., σ(An )),
σ(β X A) = β X σ−X (A).
Example 16.2.22. [a/x], [f (b)/y], [a/z] instantiates g(x, y, h(z)) to g(a, f (b), h(a)).
S
Definition 16.2.23. Let σ be a substitution then we call intro(σ):= X∈supp(σ) free(σ(X))
the set of variables introduced by σ.
The extension of a substitution is an important operation, which you will run into from time
to time. Given a substitution σ, a variable x, and an expression A, σ,[A/x] extends σ with a
new value for x. The intuition is that the values right of the comma overwrite the pairs in the
substitution on the left, which already has a value for x, even though the representation of σ may
not show it.
Substitution Extension
Definition 16.2.24 (Substitution Extension).
Let σ be a substitution, then we denote the extension of σ with [A/X] by σ,[A/X]
and define it as {(Y ,B)∈σ|Y ̸= X} ∪ {(X,A)}: σ,[A/X] coincides with σ off X,
and gives the result A there.
Note: If σ is a substitution, then σ,[A/X] is also a substitution.
We also need the dual operation: removing a variable from the support:
Definition 16.2.25. We can discharge a variable X from a substitution σ by

setting σ−X :=σ,[X/X].
Note that the use of the comma notation for substitutions defined in ?? is consistent with sub-
stitution extension. We can view a substitution [a/x], [f (b)/y] as the extension of the empty
substitution (the identity function on variables) by [f (b)/y] and then by [a/x]. Note furthermore,
that substitution extension is not commutative in general.
For first-order substitutions we need to extend the substitutions defined on terms to act on propo-
sitions. This is technically more involved, since we have to take care of bound variables.
Substitutions on Propositions
Problem: We want to extend substitutions to propositions, in particular to quan-
tified formulae: What is σ(∀X A)?
Idea: σ should not instantiate bound variables. ([A/X](∀X B) = ∀A B′

ill-formed)
Definition 16.2.26. σ(∀X A):=(∀X σ−X (A)).
Problem: This can lead to variable capture: [f (X)/Y ](∀X p(X, Y )) would eval-
uate to ∀X p(X, f (X)), where the second occurrence of X is bound after instanti-
ation, whereas it was free before. Solution: Rename away the bound variable X
in ∀X p(X, Y ) before applying the substitution.
Definition 16.2.27 (Capture-Avoiding Substitution Application). Let σ be a
substitution, A a formula, and A′ an alphabetical variant of A, such that intro(σ)∩
BVar(A) = ∅. Then we define σ(A):=σ(A′ ).
We now introduce a central tool for reasoning about the semantics of substitutions: the “sub-
stitution value Lemma”, which relates the process of instantiation to (semantic) evaluation. This
result will be the motor of all soundness proofs on axioms and inference rules acting on variables
via substitutions. In fact, any logic with variables and substitutions will have (to have) some form
of a substitution value Lemma to get the meta-theory going, so it is usually the first target in any
development of such a logic.
We establish the substitution-value Lemma for first-order logic in two steps, first on terms,
where it is very simple, and then on propositions.
Substitution Value Lemma for Terms

Lemma 16.2.28. Let A and B be terms, then I φ ([B/X]A) = I ψ (A), where
ψ = φ, [I φ (B)/X].
Proof: by induction on the depth of A:
1. depth=0 Then A is a variable (say Y ), or constant, so we have three cases
1.1. A = Y = X
1.1.1. then I φ ([B/X](A)) = I φ ([B/X](X)) = I φ (B) = ψ(X) = I ψ (X) =
I ψ (A).
1.2. A = Y ̸= X
1.2.1. then I φ ([B/X](A)) = I φ ([B/X](Y )) = I φ (Y ) = φ(Y ) = ψ(Y ) =
I ψ (Y ) = I ψ (A).
1.3. A is a constant
1.3.1. Analogous to the preceding case (Y ̸= X).
1.4. This completes the base case (depth = 0).
2. depth> 0
2.1. then A = f (A1 , . . ., An ) and we have
I φ ([B/X](A)) = I(f )(I φ ([B/X](A1 )), . . ., I φ ([B/X](An )))

= I(f )(I ψ (A1 ), . . ., I ψ (An ))
= I ψ (A).
by inductive hypothesis
2.2. This completes the inductive case, and we have proven the assertion.
Substitution Value Lemma for Propositions

Lemma 16.2.29. I φ ([B/X](A)) = I ψ (A), where ψ = φ,[I φ (B)/X].
Proof: by induction on the number n of connectives and quantifiers in A:

1. n = 0
1.1. then A is an atomic proposition, and we can argue like in the inductive
case of the substitution value lemma for terms.
2. n>0 and A = ¬B or A = C ◦ D
2.1. Here we argue like in the inductive case of the term lemma as well.
3. n>0 and A = ∀Y C where (WLOG) X ̸= Y (otherwise rename)
3.1. then I ψ (A) = I ψ (∀Y C) = T, iff I ψ,[a/Y ] (C) = T for all a∈Dι .
3.2. But I ψ,[a/Y ] (C) = I φ,[a/Y ] ([B/X](C)) = T, by inductive hypothesis.
3.3. So I ψ (A) = I φ (∀Y [B/X](C)) = I φ ([B/X](∀Y C)) = I φ ([B/X](A))
To understand the proof fully, you should think about where the WLOG – it stands for without
loss of generality comes from.
16.3 First-Order Natural Deduction

this section, we will introduce the first-order natural deduction calculus. Recall from section 12.5
that natural deduction calculus have introduction and elimination for every logical constant (the
connectives in PL0 ). Recall furthermore that we had two styles/notations for the calculus, the
classical ND calculus and the Sequent-style notation. These principles will be carried over to
natural deduction in PL1 .
This allows us to introduce the calculi in two stages, first for the (propositional) connectives and
then extend this to a calculus for first-order logic by adding rules for the quantifiers. In particular,
we can define the first-order calculi simply by adding (introduction and elimination) rules for the
(universal and existential) quantifiers to the calculus ND0 defined in section 12.5.
To obtain a first-order calculus, we have to extend ND0 with (introduction and elimination) rules
for the quantifiers.
First-Order Natural Deduction (ND1 ; Gentzen [Gen34])

Rules for connectives just as always
Definition 16.3.1 (New Quantifier Rules). The first-order natural deduction
calculus ND1 extends ND0 by the following four rules:
A ∀X A
∀I ∗ ∀E
∀X A [B/X](A)
1
[[c/X](A)]
..
∃X A . 0 new
c∈Σsk
[B/X](A) C
∃I ∃E 1
∃X A C
∗
means that A does not depend on any hypothesis in which X is free.
The intuition behind the rule ∀I is that a formula A with a (free) variable X can be generalized
to ∀X A, if X stands for an arbitrary object, i.e. there are no restricting assumptions about X.
16.3. FIRST-ORDER NATURAL DEDUCTION 271
The ∀E rule is just a substitution rule that allows to instantiate arbitrary terms B for X in A.
The ∃I rule says if we have a witness B for X in A (i.e. a concrete term B that makes A true),
then we can existentially close A. The ∃E rule corresponds to the common mathematical practice,
where we give objects we know exist a new name c and continue the proof by reasoning about this
concrete object c. Anything we can prove from the assumption [c/X](A) we can prove outright if
∃X A is known.
A Complex ND1 Example

Example 16.3.2. We prove ¬(∀X P (X))⊢ND1 ∃X ¬P (X).
[¬P (X)]2
∃I
[¬(∃X ¬P (X))]1 ∃X ¬P (X)
FI
F
¬I 2
¬¬P (X)
¬E
P (X)
∀I
¬(∀X P (X)) ∀X P (X)
FI
F
¬I 1
¬¬(∃X ¬P (X))
¬E
∃X ¬P (X)
Now we reformulate the classical formulation of the calculus of natural deduction as a sequent
calculus by lifting it to the “judgements level” as we die for propositional logic. We only need
provide new quantifier rules.
First-Order Natural Deduction in Sequent Formulation

Rules for connectives from ND⊢0
Definition 16.3.3 (New Quantifier Rules). The inference rules of the first-order
sequent calculus ND⊢1 consist of those from ND⊢0 plus the following quantifier rules:
Γ⊢A X̸∈free(Γ) Γ⊢∀X A

∀I ∀E
Γ⊢∀X A Γ⊢[B/X](A)
Γ⊢[B/X](A) 0 new
Γ⊢∃X A Γ, [c/X](A)⊢C c∈Σsk
∃I ∃E
Γ⊢∃X A Γ⊢C
Natural Deduction with Equality

Definition 16.3.4 (First-Order Logic with Equality). We extend PL1 with a new
logical symbol for equality = ∈Σp2 and fix its semantics to I(=):={(x,x)|x∈Dι }.
We call the extended logic first-order logic with equality (PL1= )
We now extend natural deduction as well.
Definition 16.3.5. For the calculus of natural deduction with equality (ND=
1
) we
add the following two rules to ND1 to deal with equality:
A = B C [A]p
=I =E
A=A [B/p]C
where C [A]p if the formula C has a subterm A at position p and [B/p]C is the
result of replacing that subterm with B.
In many ways equivalence behaves like equality, we will use the following rules in
ND1
Definition 16.3.6. ⇔I is derivable and ⇔E is admissible in ND1 :
A ⇔ B C [A]p
⇔I ⇔E
A⇔A [B/p]C
Again, we have two rules that follow the introduction/elimination pattern of natural deduction
calculi. To make sure that we understand the constructions here, let us get back to the
“replacement at position” operation used in the equality rules.
Positions in Formulae
Idea: Formulae are (naturally) trees, so we can use tree positions to talk about
subformulae
Definition 16.3.7. A position p is a tuple of natural numbers that in each node
of a expression (tree) specifies into which child to descend. For a expression A we
denote the subexpression at p with A|p .
We will sometimes write a expression C as C [A]p to indicate that C the subex-
pression A at position p.
Definition 16.3.8. Let p be a position, then [A/p]C is the expression obtained
from C by replacing the subexpression at p by A.
Example 16.3.9 (Schematically).
C [B/p]C
p p
A = C|p B
16.3. FIRST-ORDER NATURAL DEDUCTION 273
The operation of replacing a subformula at position p is quite different from e.g. (first-order)
substitutions:
• We are replacing subformulae with subformulae instead of instantiating variables with terms.
• substitutions replace all occurrences of a variable in a formula, whereas formula replacement
only affects the (one) subformula at position p.
We conclude this section with an extended example: the proof of a classical mathematical result
in the natural deduction calculus with equality. This shows us that we can derive strong properties
about complex situations (here the real numbers; an uncountably infinite set of numbers).
1
√
ND= Example: 2 is Irrational
We can do real Maths with ND=

1
:
√
Theorem 16.3.10. 2 is irrational
Proof: We prove the assertion by contradiction
√
1. Assume that 2 is rational. √
2. Then there are numbers p and q such that 2 = p/q.
3. So we know 2q 2 = p2 .
4. But 2q 2 has an odd number of prime factors while p2 an even number.
5. This is a contradiction (since they are equal), so we have proven the assertion
If we want to formalize this into ND1 , we have to write down all the assertions in the proof steps
in PL1 syntax and come up with justifications for them in terms of ND1 inference rules. The next
two slides show such a proof, where we write ′n to denote that n is prime, use #(n) for the number
of prime factors of a number n, and write irr(r) if r is irrational.
1
√
ND= Example: 2 is Irrational (the Proof)
# hyp formula NDjust

1 ∀n, m ¬(2n + 1) = (2m) lemma
2 ∀n, m #(nm ) = m#(n) lemma
3 ∀n, p prime(p) ⇒ #(pn) = (#(n) + 1) lemma
4 ∀x √irr(x) ⇔ (¬(∃p, q√x = p/q)) definition
5 irr( √ 2) ⇔ (¬(∃p, q 2 = p/q)) ∀E(4)
6 6 ¬irr( 2) √ Ax
7 6 ¬¬(∃p, √ q 2 = p/q) ⇔E(6, 5)
8 6 ∃p,
√ q 2 = p/q ¬E(7)
9 6,9 2 = p/q Ax
10 6,9 2q 2 = p2 arith(9)
11 6,9 #(p2 ) = 2#(p) ∀E 2 (2)
12 6,9 prime(2) ⇒ #(2q 2 ) = (#(q 2 ) + 1) ∀E 2 (1)
Lines 6 and 9 are local hypotheses for the proof (they only have an implicit counterpart in the
inference rules as defined above). Finally we have abbreviated the arithmetic simplification of line
9 with the justification “arith” to avoid having to formalize elementary arithmetic.
1
√
ND= Example: 2 is Irrational (the Proof continued)
13 prime(2) lemma
14 6,9 #(2q 2 ) = #(q 2 ) + 1 ⇒E(13, 12)
15 6,9 #(q 2 ) = 2#(q) ∀E 2 (2)
16 6,9 #(2q 2 ) = 2#(q) + 1 =E(14, 15)
17 #(p2 ) = #(p2 ) =I
18 6,9 #(2q 2 ) = #(q 2 ) =E(17, 10)
19 6.9 2#(q) + 1 = #(p2 ) =E(18, 16)
20 6.9 2#(q) + 1 = 2#(p) =E(19, 11)
21 6.9 ¬(2#(q) + 1) = (2#(p)) ∀E 2 (1)
22 6,9 F F I(20, 21)
23 6 F √ ∃E 6 (22)
24 ¬¬irr(
√ 2) ¬I 6 (23)
25 irr( 2) ¬E 2 (23)
We observe that the ND1 proof is much more detailed, and needs quite a few Lemmata about
# to go through. Furthermore, we have added a definition of irrationality (and treat definitional
equality via the equality rules). Apart from these artefacts of formalization, the two representations
of proofs correspond to each other very directly.
16.4 Conclusion
Summary (Predicate Logic)
Predicate logic allows to explicitly speak about objects and their properties. It is
thus a more natural and compact representation language than propositional logic;
it also enables us to speak about infinite sets of objects.
Logic has thousands of years of history. A major current application in AI is Se-
mantic Technology.
First-order predicate logic (PL1) allows universal and existential quantification over
objects.
A PL1 interpretation consists of a universe U and a function I mapping constant
symbols/predicate symbols/function symbols to elements/relations/functions on U .
Suggested Reading:
• Chapter 8: First-Order Logic, Sections 8.1 and 8.2 in [RN09]

– A less formal account of what I cover in “Syntax” and “Semantics”. Contains different exam-
ples, and complementary explanations. Nice as additional background reading.
• Sections 8.3 and 8.4 provide additional material on using PL1, and on modeling in PL1, that I
don’t cover in this lecture. Nice reading, not required for exam.
• Chapter 9: Inference in First-Order Logic, Section 9.5.1 in [RN09]

– A very brief (2 pages) description of what I cover in “Normal Forms”. Much less formal; I
couldn’t find where (if at all) RN cover transformation into prenex normal form. Can serve
as additional reading, can’t replace the lecture.
• Excursion: A full analysis of any calculus needs a completeness proof. We will not cover this
in AI-2, but provide one for the calculi introduced so far in??.
Chapter 17
Automated Theorem Proving in

First-Order Logic
In this chapter, we take up the machine-oriented calculi for propositional logic from chapter 13
and extend them to the first-order case. While this has been relatively easy for the natural
deduction calculus – we only had to introduce the notion of substitutions for the elimination rule
for the universal quantifier we have to work much more here to make the calculi effective for
implementation.
17.1 First-Order Inference with Tableaux

17.1.1 First-Order Tableau Calculi
Test Calculi: Tableaux and Model Generation

Idea: A tableau calculus is a test calculus that
analyzes a labeled formulae in a tree to determine satisfiability,
its branches correspond to valuations (; models).
Example 17.1.1.Tableau calculi try to construct models for labeled formulae:
Tableau refutation (Validity) Model generation (Satisfiability)

|=P ∧ Q ⇒ Q ∧ P |=P ∧ (Q ∨ ¬R) ∧ ¬Q
(P ∧ (Q ∨ ¬R) ∧ ¬Q)T
(P ∧ Q ⇒ Q ∧ P )F
(P ∧ (Q ∨ ¬R))T
(P ∧ Q)T
¬QT
(Q ∧ P )F QF
PT PT
QT
(Q ∨ ¬R)T
P QF
F
QT ¬RT
⊥ ⊥
⊥ RF
No Model Herbrand Model {P T , QF , RF }
φ:={P 7→ T, Q 7→ F, R 7→ F}
Idea: Open branches in saturated tableaux yield models.
277
278 CHAPTER 17. AUTOMATED THEOREM PROVING IN FIRST-ORDER LOGIC
Algorithm: Fully expand all possible tableaux, (no rule can be applied)
Satisfiable, iff there are open branches (correspond to models)
Tableau calculi develop a formula in a tree-shaped arrangement that represents a case analysis on
when a formula can be made true (or false). Therefore the formulae are decorated with exponents
that hold the intended truth value.
On the left we have a refutation tableau that analyzes a negated formula (it is decorated with the
intended truth value F). Both branches contain an elementary contradiction ⊥.
On the right we have a model generation tableau, which analyzes a positive formula (it is
decorated with the intended truth value T. This tableau uses the same rules as the refutation
tableau, but makes a case analysis of when this formula can be satisfied. In this case we have a
closed branch and an open one, which corresponds a model).
Now that we have seen the examples, we can write down the tableau rules formally.
Analytical Tableaux (Formal Treatment of T0 )

Idea: A test calculus where
A labeled formula is analyzed in a tree to determine satisfiability,
branches correspond to valuations (models)
Definition 17.1.2. The propositional tableau calculus T0 has two inference rules
per connective (one for each possible label)
Aα
T F α ̸= β
(A ∧ B) (A ∧ B) ¬A T
¬A F
Aβ
T0 ∧
T0 ∨ T0 ¬T T0 ¬F T0 ⊥
AT AF BF AF AT ⊥
BT
Use rules exhaustively as long as they contribute new material (; termination)

Definition 17.1.3. We call any tree ( introduces branches) produced by the T0
inference rules from a set Φ of labeled formulae a tableau for Φ.
Definition 17.1.4. Call a tableau saturated, iff no rule adds new material and a
branch closed, iff it ends in ⊥, else open. A tableau is closed, iff all of its branches
are.
These inference rules act on tableaux have to be read as follows: if the formulae over the line
appear in a tableau branch, then the branch can be extended by the formulae or branches below
the line. There are two rules for each primary connective, and a branch closing rule that adds the
special symbol ⊥ (for unsatisfiability) to a branch.
We use the tableau rules with the convention that they are only applied, if they contribute new
material to the branch. This ensures termination of the tableau procedure for propositional logic
(every rule eliminates one primary connective).
Definition 17.1.5. We will call a closed tableau with the labeled formula Aα at the root a
tableau refutation for Aα .
17.1. FIRST-ORDER INFERENCE WITH TABLEAUX 279
The saturated tableau represents a full case analysis of what is necessary to give A the truth value
α; since all branches are closed (contain contradictions) this is impossible.
Analytical Tableaux (T0 continued)
Definition 17.1.6 (T0 -Theorem/Derivability).

A is a T0 -theorem (⊢T0 A), iff there is a closed tableau with AF at the root.
Φ ⊆ wff0 (V0 ) derives A in T0 (Φ⊢T0 A), iff there is a closed tableau starting with AF
and ΦT . The tableau with only a branch of AF and ΦT is called initial for Φ⊢T0 A.
Definition 17.1.7. We will call a tableau refutation for AF a tableau proof for A, since it refutes
the possibility of finding a model where A evaluates to F. Thus A must evaluate to T in all
models, which is just our definition of validity.
Thus the tableau procedure can be used as a calculus for propositional logic. In contrast to the
propositional Hilbert calculus it does not prove a theorem A by deriving it from a set of axioms,
but it proves it by refuting its negation. Such calculi are called negative or test calculi. Generally
negative calculi have computational advantages over positive ones, since they have a built-in sense
of direction.
We have rules for all the necessary connectives (we restrict ourselves to ∧ and ¬, since the others
can be expressed in terms of these two via the propositional identities above. For instance, we can
write A ∨ B as ¬(¬A ∧ ¬B), and A ⇒ B as ¬A ∨ B,. . . .)
We will now extend the propositional tableau techniques to first-order logic. We only have to add
two new rules for the universal quantifiers (in positive and negative polarity).
First-Order Standard Tableaux (T1 )

Definition 17.1.8. The standard tableau calculus (T1 ) extends T0 (propositional
tableau calculus) with the following quantifier rules:
T F
(∀X A) C∈cwff ι (Σι ) (∀X A) 0 new
c∈Σsk
T
T1 ∀ F
T1 ∃
([C/X](A)) ([c/X](A))
Problem: The rule T1 ∀ displays a case of “don’t know indeterminism”: to find a

refutation we have to guess a formula C from the (usually infinite) set cwff ι (Σι ).
For proof search, this means that we have to systematically try all, so T1 ∀ is infinitely
branching in general.
The rule T1 ∀ operationalizes the intuition that a universally quantified formula is true, iff all
of the instances of the scope are. To understand the T1 ∃ rule, we have to keep in mind that
F T
∃X A abbreviates ¬(∀X ¬A), so that we have to read (∀X A) existentially — i.e. as (∃X ¬A) ,
stating that there is an object with property ¬A. In this situation, we can simply give this
object a name: c, which we take from our (infinite) set of witness constants Σsk 0 , which we have
given ourselves expressly for this purpose when we defined first-order syntax. In other words
T F
([c/X](¬A)) = ([c/X](A)) holds, and this is just the conclusion of the T1 ∃ rule.
Note that the T1 ∀ rule is computationally extremely inefficient: we have to guess an (i.e. in a
search setting to systematically consider all) instance C∈wff ι (Σι , Vι ) for X. This makes the rule
infinitely branching.
In the next calculus we will try to remedy the computational inefficiency of the T1 ∀ rule. We do
this by delaying the choice in the universal rule.
Free variable Tableaux (T1f )
Definition 17.1.9. The free variable tableau calculus (T1f ) extends T0 (proposi-
tional tableau calculus) with the quantifier rules:
(∀X A)T Y new f (∀X A)F free(∀X A) = {X 1 , . . ., X k } f ∈Σsk

k new f
T1 ∀ T1 ∃
([Y /X](A))T ([f (X 1 , . . . , X k )/X](A))F
and generalizes its cut rule T0 ⊥ to:
Aα
α ̸= β σ(A) = σ(B)
Bβ
T1f⊥
⊥:σ
T1f⊥ instantiates the whole tableau by σ.
Advantage: No guessing necessary in T1f ∀-rule!

New Problem: find suitable substitution (most general unifier) (later)
Metavariables: Instead of guessing a concrete instance for the universally quantified variable
as in the T1 ∀ rule, T1f ∀ instantiates it with a new meta-variable Y , which will be instantiated by
need in the course of the derivation.
Skolem terms as witnesses: The introduction of meta-variables makes is necessary to extend
the treatment of witnesses in the existential rule. Intuitively, we cannot simply invent a new name,
since the meaning of the body A may contain meta-variables introduced by the T1f ∀ rule. As we
do not know their values yet, the witness for the existential statement in the antecedent of the
T1f ∃ rule needs to depend on that. So witness it using a witness term, concretely by applying a
Skolem function to the meta-variables in A.
Instantiating Metavariables: Finally, the T1f⊥ rule completes the treatment of meta-variables,
it allows to instantiate the whole tableau in a way that the current branch closes. This leaves us
with the problem of finding substitutions that make two terms equal.
Free variable Tableaux (T1f ): Derivable Rules
Definition 17.1.10. Derivable quantifier rules in T1f :

T
(∃X A) free(∀X A) = {X 1 , . . ., X k } f ∈Σsk
k new
T
([f (X 1 , . . . , X k )/X](A))
F
(∃X A) Y new
F
([Y /X](A))

Let’s Talk
Tableau Aboutabout
Reasons Blocks, Baby . . .
Blocks
Example 17.1.11 (Reasoning about Blocks). Returing to slide 418
I Question: What do you see here?
A D B E C
I You say: “All blocks are red”; “All blocks are on the table”; “A is a block”.
Can we prove red(A) from ∀x block(x) ⇒ red(x) and block(A)?
I And now: Say it in propositional logic!
T
(∀X block(X) ⇒ red(X))
T
block(A)
F
red(A)
T
(block(Y ) ⇒ red(Y ))
F T
block(Y ) red(A)
⊥ : [A/Y ] ⊥
17.1.2 First-Order Unification

Video Nuggets covering this subsection can be found
Kohlhase: Künstliche Intelligenz 1 416
at https://fau.tv/clip/id/26810
July 5, 2018
and
We will now look into the problem of finding a substitution σ that make two terms equal (we
say it unifies them) in more detail. The presentation of the unification algorithm we give here
“transformation-based” this has been a very influential way to treat certain algorithms in theoret-
ical computer science.
A transformation-based view of algorithms: The “transformation-based” view of algorithms
divides two concerns in presenting and reasoning about algorithms according to Kowalski’s slogan
[Kow97]
algorithm = logic + control
The computational paradigm highlighted by this quote is that (many) algorithms can be thought
of as manipulating representations of the problem at hand and transforming them into a form
that makes it simple to read off solutions. Given this, we can simplify thinking and reasoning
about such algorithms by separating out their “logical” part, which deals with is concerned with
how the problem representations can be manipulated in principle from the “control” part, which
is concerned with questions about when to apply which transformations.
It turns out that many questions about the algorithms can already be answered on the “logic”
level, and that the “logical” analysis of the algorithm can already give strong hints as to how to
optimize control.
In fact we will only concern ourselves with the “logical” analysis of unification here.
The first step towards a theory of unification is to take a closer look at the problem itself. A first
set of examples show that we have multiple solutions to the problem of finding substitutions that
make two terms equal. But we also see that these are related in a systematic way.
Unification (Definitions)
Definition 17.1.12. For given terms A and B, unification is the problem of finding
a substitution σ, such that σ(A) = σ(B).
Notation: We write term pairs as A=?B e.g. f (X)=?f (g(Y )).
Definition 17.1.13. Solutions (e.g. [g(a)/X], [a/Y ], [g(g(a))/X], [g(a)/Y ], or

[g(Z)/X], [Z/Y ]) are called unifiers, U(A=?B):={σ|σ(A) = σ(B)}.
Idea: Find representatives in U(A=?B), that generate the set of solutions.
Definition 17.1.14. Let σ and θ be substitutions and W ⊆ Vι , we say that

a substitution σ is more general than θ (on W ; write σ≤θ[W ]), iff there is a
substitution ρ, such that θ=(ρ ◦ σ)[W ], where σ=ρ[W ], iff σ(X) = ρ(X) for all
X∈W .
Definition 17.1.15. σ is called a most general unifier (mgu) of A and B, iff it is
minimal in U(A=?B) wrt. ≤[(free(A) ∪ free(B))].
The idea behind a most general unifier is that all other unifiers can be obtained from it by (further)
instantiation. In an automated theorem proving setting, this means that using most general
unifiers is the least committed choice — any other choice of unifiers (that would be necessary for
completeness) can later be obtained by other substitutions.
Note that there is a subtlety in the definition of the ordering on substitutions: we only compare
on a subset of the variables. The reason for this is that we have defined substitutions to be total
on (the infinite set of) variables for flexibility, but in the applications (see the definition of most
general unifiers), we are only interested in a subset of variables: the ones that occur in the initial
problem formulation. Intuitively, we do not care what the unifiers do off that set. If we did
not have the restriction to the set W of variables, the ordering relation on substitutions would
become much too fine-grained to be useful (i.e. to guarantee unique most general unifiers in our
case).
Now that we have defined the problem, we can turn to the unification algorithm itself. We
will define it in a way that is very similar to logic programming: we first define a calculus that
generates “solved forms” (formulae from which we can read off the solution) and reason about
control later. In this case we will reason that control does not matter.
Unification Problems (=
b Equational Systems)
Idea: Unification is equation solving.
Definition 17.1.16.
We call a formula A1=?B1 ∧. . .∧An=?Bn an unification problem iff Ai , Bi ∈wff ι (Σι , Vι ).
Note: We consider unification problems as sets of equations (∧ is ACI), and
equations as two-element multisets (=? is C).
Definition 17.1.17. A substitution is called a unifier for a unification problem E

(and thus D unifiable), iff it is a (simultaneous) unifier for all pairs in E.
In principle, unification problems are sets of equations, which we write as conjunctions, since
all of them have to be solved for finding a unifier. Note that it is not a problem for the “logical
view” that the representation as conjunctions induces an order, since we know that conjunction
is associative, commutative and idempotent, i.e. that conjuncts do not have an intrinsic order or
multiplicity, if we consider two equational problems as equal, if they are equivalent as propositional
formulae. In the same way, we will abstract from the order in equations, since we know that the
equality relation is symmetric. Of course we would have to deal with this somehow in the imple-
mentation (typically, we would implement equational problems as lists of pairs), but that belongs
into the “control” aspect of the algorithm, which we are abstracting from at the moment.
Solved forms and Most General Unifiers

Definition 17.1.18. We call a pair A=?B solved in a unification problem E, iff
A = X, E = X=?A∧E, and X̸∈(free(A)∪free(E)). We call an unification problem
E a solved form, iff all its pairs are solved.
Lemma 17.1.19. Solved forms are of the form X 1=?B1 ∧ . . . ∧ X n=?Bn where
the X i are distinct and X i ∈
̸ free(Bj ).
Definition 17.1.20. Any substitution σ = [B1 /X 1 ], . . . ,[Bn /X n ] induces a solved
unification problem E σ :=(X 1=?B1 ∧ . . . ∧ X n=?Bn ).
Lemma 17.1.21. If E = X 1=?B1 ∧ . . . ∧ X n=?Bn is a solved form, then E has
the unique most general unifier σ E :=[B1 /X 1 ], . . . ,[Bn /X n ].
Proof: Let θ∈U(E)

1. then θ(X i ) = θ(Bi ) = θ ◦ σ E (X i )
2. and thus θ=(θ ◦ σ E )[supp(σ)].
Note: We can rename the introduced variables in most general unifiers!
It is essential to our “logical” analysis of the unification algorithm that we arrive at unification prob-
lems whose unifiers we can read off easily. Solved forms serve that need perfectly as Lemma 17.1.21
shows.
Given the idea that unification problems can be expressed as formulae, we can express the algo-
rithm in three simple rules that transform unification problems into solved forms (or unsolvable
ones).
Unification Algorithm
Definition 17.1.22. The inference system U consists of the following rules:
E ∧ f (A1 , . . ., An )=?f (B1 , . . ., Bn ) E ∧ A=?A

Udec Utriv
E ∧ A1=?B1 ∧ . . . ∧ An=?Bn E
E ∧ X=?A X̸∈free(A) X∈free(E)
Uelim
[A/X](E) ∧ X=?A
Lemma 17.1.23. U is correct: E⊢U F implies U(F) ⊆ U(E).
Lemma 17.1.24. U is complete: E⊢U F implies U(E) ⊆ U(F).

Lemma 17.1.25. U is confluent: the order of derivations does not matter.
Corollary 17.1.26. First-order unification is unitary: i.e. most general unifiers are
unique up to renaming of introduced variables.
Proof sketch: U is trivially branching.

The decomposition rule Udec is completely straightforward, but note that it transforms one unifi-
cation pair into multiple argument pairs; this is the reason, why we have to directly use unification
problems with multiple pairs in U.
Note furthermore, that we could have restricted the Utriv rule to variable-variable pairs, since
for any other pair, we can decompose until only variables are left. Here we observe, that constant-
constant pairs can be decomposed with the Udec rule in the somewhat degenerate case without
arguments.
Finally, we observe that the first of the two variable conditions in Uelim (the “occurs-in-check”)
makes sure that we only apply the transformation to unifiable unification problems, whereas the
second one is a termination condition that prevents the rule to be applied twice.
The notion of completeness and correctness is a bit different than that for calculi that we
compare to the entailment relation. We can think of the “logical system of unifiability” with
the model class of sets of substitutions, where a set satisfies an equational problem E, iff all of
its members are unifiers. This view induces the soundness and completeness notions presented
above.
The three meta-properties above are relatively trivial, but somewhat tedious to prove, so we leave
the proofs as an exercise to the reader.
We now fortify our intuition about the unification calculus by two examples. Note that we only
need to pursue one possible U derivation since we have confluence.
Unification Examples
Example 17.1.27. Two similar unification problems:
f (g(X, X), h(a))=?f (g(a, Z), h(Z))

U dec f (g(X, X), h(a))=?f (g(b, Z), h(Z))
g(X, X)=?g(a, Z) ∧ h(a)=?h(Z) U dec
U dec g(X, X)=?g(b, Z) ∧ h(a)=?h(Z)
X=?a ∧ X=?Z ∧ h(a)=?h(Z) U dec
U dec X=?b ∧ X=?Z ∧ h(a)=?h(Z)
X=?a ∧ X=?Z ∧ a=?Z U dec
U elim X=?b ∧ X=?Z ∧ a=?Z
X=?a ∧ a=?Z ∧ a=?Z U elim
U elim X=?b ∧ b=?Z ∧ a=?Z
? ?
X= a ∧ Z= a ∧ a= a ? U elim
U triv X=?b ∧ Z=?b ∧ a=?b
X=?a ∧ Z=?a
MGU: [a/X], [a/Z] a=?b not unifiable
We will now convince ourselves that there cannot be any infinite sequences of transformations in
U. Termination is an important property for an algorithm.
The proof we present here is very typical for termination proofs. We map unification problems
into a partially ordered set ⟨S, ≺⟩ where we know that there cannot be any infinitely descending
sequences (we think of this as measuring the unification problems). Then we show that all trans-
formations in U strictly decrease the measure of the unification problems and argue that if there
were an infinite transformation in U, then there would be an infinite descending chain in S, which
contradicts our choice of ⟨S, ≺⟩.
The crucial step in in coming up with such proofs is finding the right partially ordered set.
Fortunately, there are some tools we can make use of. We know that ⟨N, <⟩ is terminating, and
there are some ways of lifting component orderings to complex structures. For instance it is well-
known that the lexicographic ordering lifts a terminating ordering to a terminating ordering on
finite dimensional Cartesian spaces. We show a similar, but less known construction with multisets
for our proof.
Unification (Termination)
Definition 17.1.28. Let S and T be multisets and ≤ a partial ordering on S ∪ T .
Then we define S ≺m S, iff S = C ⊎ T ′ and T = C ⊎ {t}, where s≤t for all s∈S ′ .
We call ≤m the multiset ordering induced by ≤.
Definition 17.1.29. We call a variable X solved in an unification problem E, iff E
contains a solved pair X=?A.
Lemma 17.1.30. If ≺ is linear/terminating on S, then ≺m is linear/terminating
on P(S).
Lemma 17.1.31. U is terminating. (any U-derivation is finite)
Proof: We prove termination by mapping U transformation into a Noetherian space.

1. Let µ(E):=⟨n, N ⟩, where
n is the number of unsolved variables in E
N is the multiset of term depths in E

2. The lexicographic order ≺ on pairs µ(E) is decreased by all inference rules.
2.1. Udec and Utriv decrease the multiset of term depths without increasing
the unsolved variables.
2.2. Uelim decreases the number of unsolved variables (by one), but may in-
crease term depths.
But it is very simple to create terminating calculi, e.g. by having no inference rules. So there
is one more step to go to turn the termination result into a decidability result: we must make sure
that we have enough inference rules so that any unification problem is transformed into solved
form if it is unifiable.
First-Order Unification is Decidable

Definition 17.1.32. We call an equational problem E U-reducible, iff there is a
U-step E⊢U F from E.
Lemma 17.1.33. If E is unifiable but not solved, then it is U-reducible.

Proof: We assume that E is unifiable but unsolved and show the U rule that applies.
1. There is an unsolved pair A=?B in E = E ∧ A=?B′ .
we have two cases
2. A, B̸∈Vι
2.1. then A = f (A1 . . . An ) and B = f (B1 . . . Bn ), and thus Udec is appli-
cable
3. A = X∈free(E)
3.1. then Uelim (if B ̸= X) or Utriv (if B = X) is applicable.
Corollary 17.1.34. First-order unification is decidable in PL1 .
Proof:
1. U-irreducible unification problems can be reached in finite time by Lemma 17.1.31.
2. They are either solved or unsolvable by Lemma 17.1.33, so they provide the
answer.
17.1.3 Efficient Unification

Now that we have seen the basic ingredients of an unification algorithm, let us as always
consider complexity and efficiency issues.
We start with a look at the complexity of unification and – somewhat surprisingly – find expo-
nential time/space complexity based simply on the fact that the results – the unifiers – can be
exponentially large.
Complexity of Unification
Observation: Naive implementations of unification are exponential in time and
space.
Example 17.1.35. Consider the terms
sn = f (f (x0 , x0 ), f (f (x1 , x1 ), f (. . . , f (xn−1 , xn−1 )) . . .))

tn = f (x1 , f (x2 , f (x3 , f (· · · , xn ) · · · )))
The most general unifier of sn and tn is

σ n :=[f (x0 , x0 )/x1 ], [f (f (x0 , x0 ), f (x0 , x0 ))/x2 ], [f (f (f (x0 , x0 ), f (x0 , x0 )), f (f (x0 , x0 ), f (x0 , x0 )))/x3 ], . . .
Pn
It contains i=1 2i = 2n+1 − 2 occurrences of the variable x0 . (exponential)
Problem: The variable x0 has been copied too often.
Idea: Find a term representation that re-uses subterms.
Indeed, the only way to escape this combinatorial explosion is to find representations of substitu-
tions that are more space efficient.
Directed Acyclic Graphs (DAGs) for Terms

Recall: Terms in first-order logic are essentially trees.
Concrete Idea: Use directed acyclic graphs for representing terms:

variables my only occur once in the DAG.
subterms can be referenced multiply. (subterm sharing)
we can even represent multiple terms in a common DAG
Observation 17.1.36. Terms can be transformed into DAGs in linear time.

Example 17.1.37. Continuing from Example 17.1.35 . . . s3 , t3 , and σ 3 (s3 ) as
DAGs:
s3 t3 σ3 (t3 )
f f f
f
f f f f
f
x0 f f f
x1 x2 x3 x0
In general: sn , tn , and σ n (sn ) only need space in O(n). (just count)
If we look at the unification algorithm from Definition 17.1.22 and the considerations in the
termination proof (Lemma 460) with a particular focus on the role of copying, we easily find the
culprit for the exponential blowup: Uelim, which applies solved pairs as substitutions.
DAG Unification Algorithm

Observation: In U, the Uelim rule applies solved pairs ; subterm duplication.
Idea: Replace Uelim the notion of solved forms by something better.

Definition 17.1.38. We say that X 1=?B1 ∧ . . . ∧ X n=?Bn is a DAG solved form,
iff the X i are distinct and X i ̸∈free(Bj ) for i≤j.
Definition 17.1.39. The inference system DU contains rules Udec and Utriv from
U plus the following:
E ∧ X=?A ∧ X=?B A, B̸∈Vι |A|≤|B|

DUmerge
E ∧ X=?A ∧ A=?B
E ∧ X=?Y X ̸= Y X, Y ∈free(E)
DUevar
[Y /X](E) ∧ X=?Y
where |A| is the number of symbols in A.
The analysis for U applies mutatis mutandis.
We will now turn the ideas we have developed in the last couple of slides into a usable func-
tional algorithm. The starting point is treating terms as DAGs. Then we try to conduct the
transformation into solved form without adding new nodes.
Unification by DAG-chase
Idea: Extend the Input-DAGs by edges that represent unifiers.
write n.a, if a is the symbol of node n.
(standard) auxiliary procedures: (all constant or linear time in DAGs)

find(n) follows the path from n and returns the end node.
union(n, m) adds an edge between n and m.

occur(n, m) determines whether n.x occurs in the DAG with root m.
Algorithm dag−unify
Input: symmetric pairs of nodes in DAGs

fun dag−unify(n,n) = true
| dag−unify(n.x,m) = if occur(n,m) then true else union(n,m)
| dag−unify(n.f ,m.g) =
if g!=f then false
else
forall (i,j) => dag−unify(find(i),find(j)) (chld m,chld n)
end
Observation 17.1.40. dag−unify uses linear space, since no new nodes are created,
and at most one link per variable.
Problem: dag−unify still uses exponential time.
Example 17.1.41.
Consider terms f (sn , f (t′ n , xn )), f (tn , f (s′ n , y n ))), where s′ n = [y i /xi ](sn ) und
t′ n = [y i /xi ](tn ).
dag−unify needs exponentially many recursive calls to unify the nodes xn and y n .
(they are unified after n calls, but checking needs the time)
Algorithm uf−unify
Recall: dag−unify still uses exponential time.

Idea: Also bind the function nodes, if the arguments are unified.
uf−unify(n.f ,m.g) =
if g!=f then false
else union(n,m);
forall (i,j) => uf−unify(find(i),find(j)) (chld m,chld n)
end
This only needs linearly many recursive calls as it directly returns with true or makes
a node inaccessible for find.
Linearly many calls to linear procedures give quadratic running time.
Remark: There are versions of uf−unify that are linear in time and space, but for
most purposes, our algorithm suffices.

17.1.4 Implementing First-Order Tableaux

We now come to some issues (and clarifications) pertaining to implementing proof search for
free variable tableaux. They all have to do with the – often overlooked – fact that T1f⊥ instantiates
the whole tableau.
The first question one may ask for implementation is whether we expect a terminating proof
search; after all, T0 terminated. We will see that the situation for T1f is different.
Termination and Multiplicity in Tableaux

Recall:
In T0 , all rules only needed to be applied once.
; T0 terminates and thus induces a decision procedure for PL0 .
Observation 17.1.42. All T1f rules except T1f ∀ only need to be applied once.
Example 17.1.43. A tableau proof for (p(a) ∨ p(b)) ⇒ (∃ p()).
Start, close left branch use T1f ∀ again (right branch)

F
((p(a) ∨ p(b)) ⇒ (∃ p()))
F T
((p(a) ∨ p(b)) ⇒ (∃ p())) (p(a) ∨ p(b))
T F
(p(a) ∨ p(b)) (∃x p(x))
F T
(∃x p(x)) (∀x ¬p(x))
T T
(∀x ¬p(x)) ¬p(a)
T F
¬p(y) p(a)
F T T
p(y) p(a) p(b)
p(a)
T
p(b)
T ⊥ : [a/y] ¬p(z)T
⊥ : [a/y] p(z)
F
⊥ : [b/z]
After we have used up p(y) by applying [a/y] in T1f⊥, we have to get a new instance
F
p(z) via T1f ∀.

F
Definition 17.1.44. Let T be a tableau for A, and a positive occurrence of ∀x B

in A, then we call the number of applications of T1f ∀ to ∀x B its multiplicity.
Observation 17.1.45. Given a prescribed multiplicity for each positive ∀, satura-
tion with T1f terminates.
Proof sketch: All T1f rules reduce the number of connectives and negative ∀ or the
multiplicity of positive ∀.
Theorem 17.1.46. T1f is only complete with unbounded multiplicities.
Proof sketch: Replace p(a) ∨ p(b) with p(a1 ) ∨ . . . ∨ p(an ) in Example 17.1.43.
Remark: Otherwise validity in PL1 would be decidable.
Implementation: We need an iterative multiplicity deepening process.
The other thing we need to realize is that there may be multiple ways we can use T1f⊥ to close a
branch in a tableau, and – as T1f⊥ instantiates the whole tableau and not just the branch itself –
this choice matters.
Treating T1f⊥
Recall: The T1f⊥ rule instantiates the whole tableau.
Problem: There may be more than one T1f⊥ opportunity on a branch.
Example 17.1.47. Choosing which matters – this tableau does not close!
F
(∃x (p(a) ∧ p(b) ⇒ p()) ∧ (q(b) ⇒ q(x)))
F
((p(a) ∧ p(b) ⇒ p()) ∧ (q(b) ⇒ q(y)))
F F
(p(a) ⇒ p(b) ⇒ p()) (q(b) ⇒ q(y))
T T
p(a) q(b)
T F
p(b) q(y)
F
p(y)
⊥ : [a/y]
choosing the other T1f⊥ in the left branch allows closure.
Idea: Two ways of systematic proof search in T1f :
backtracking search over T1f⊥ opportunities

saturate without T1f⊥ and find spanning matings (next slide)
The method of spanning matings follows the intuition that if we do not have good information
on how to decide for a pair of opposite literals on a branch to use in T1f⊥, we delay the choice by
initially disregarding the rule altogether during saturation and then – in a later phase– looking
for a configuration of cuts that have a joint overall unifier. The big advantage of this is that we
only need to know that one exists, we do not need to compute or apply it, which would lead to
exponential blow-up as we have seen above.
Spanning Matings for T1f⊥
Observation 17.1.48. T1f without T1f⊥ is terminating and confluent for given
multiplicities.
Idea: Saturate without T1f⊥ and treat all cuts at the same time (later).
Definition 17.1.49.
Let T be a T1f tableau, then we call a unification problem E:=A1=?B1 ∧ . . . ∧
An=?Bn a mating for T , iff Ai T and Bi F occur in the same branch in T .
We say that E is a spanning mating, if E is unifiable and every branch B of T
contains Ai T and Bi F for some i.
Theorem 17.1.50. A T1f -tableau with a spanning mating induces a closed T1
tableau.
Proof sketch: Just apply the unifier of the spanning mating.
17.2. FIRST-ORDER RESOLUTION 291
Idea: Existence is sufficient, we do not need to compute the unifier.

Implementation: Saturate without T1f⊥, backtracking search for spanning mat-
ings with DU, adding pairs incrementally.
Excursion: Now that we understand basic unification theory, we can come to the meta-theoretical
properties of the tableau calculus. We delegate this discussion to??.
17.2 First-Order Resolution

First-Order Resolution (and CNF)

Definition 17.2.1. The first-order CNF calculus CNF1 is given by the inference
rules of CNF0 extended by the following quantifier rules:
T
(∀X A) ∨ C Z̸∈(free(A) ∪ free(C))
T
([Z/X](A)) ∨ C
F
k new
(∀X A) ∨ C {X 1 , . . ., X k } = free(∀X A) f ∈Σsk
F
([f (X 1 , . . ., X k )/X](A)) ∨ C
CNF1 (Φ) is the set of all clauses that can be derived from Φ.
Definition 17.2.2 (First-Order Resolution Calculus). First-order resolution (R1 )
is a test calculus that manipulates formulae in conjunctive normal form. R1 has two
inference rules:
AT ∨ C BF ∨ D σ = mgu(A, B) Aα ∨ Bα ∨ C σ = mgu(A, B)
(σ(C)) ∨ (σ(D)) (σ(A)) ∨ (σ(C))
First-Order CNF – Derived Rules

Definition 17.2.3. The following inference rules are derivable from the ones above
via (∃X A) = ¬(∀X ¬A):
T
(∃X A) ∨ C {X 1 , . . ., X k } = free(∀X A) f ∈Σsk
k new
T
([f (X 1 , . . ., X k )/X](A)) ∨ C
F
(∃X A) ∨ C Z̸∈(free(A) ∪ free(C))
F
([Z/X](A)) ∨ C
Excursion: Again, we relegate the meta-theoretical properties of the first-order resolution calculus
to??.
17.2.1 Resolution Examples
Col. West, a Criminal?

Example 17.2.4. From [RN09]
The law says it is a crime for an American to sell weapons to hostile nations.
The country Nono, an enemy of America, has some missiles, and all of its
missiles were sold to it by Colonel West, who is American.
Prove that Col. West is a criminal.
Remark: Modern resolution theorem provers prove this in less than 50ms.
Problem: That is only true, if we only give the theorem prover exactly the right
laws and background knowledge. If we give it all of them, it drowns in the combi-
natory explosion.
Let us build a resolution proof for the claim above.

But first we must translate the situation into first-order logic clauses.
Convention: In what follows, for better readability we will sometimes write impli-
cations P ∧ Q ∧ R ⇒ S instead of clauses P F ∨ QF ∨ RF ∨ S T .
Col. West, a Criminal?

It is a crime for an American to sell weapons to hostile nations:
Clause: ami(X1 ) ∧ weap(Y1 ) ∧ sell(X1 , Y1 , Z1 ) ∧ host(Z1 ) ⇒ crook(X1 )
Nono has some missiles: ∃X own(NN, X) ∧ mle(X)
T
Clauses: own(NN, c) and mle(c) (c is Skolem constant)
All of Nono’s missiles were sold to it by Colonel West.
Clause: mle(X2 ) ∧ own(NN, X2 ) ⇒ sell(West, X2 , NN)
Missiles are weapons:

Clause: mle(X3 ) ⇒ weap(X3 )
An enemy of America counts as “hostile”:
Clause: enmy(X4 , USA) ⇒ host(X4 )
West is an American:
Clause: ami(West)
The country Nono is an enemy of America:
enmy(NN, USA)
Col. West, a Criminal! PL1 Resolution Proof

17.2. FIRST-ORDER RESOLUTION 293
ami(X1 )F ∨ weap(Y1 )F ∨ sell(X1 , Y1 , Z1 )F ∨ host(Z1 )F ∨ crook(X1 )T crook(West)F

[West/X1 ]
ami(West)T ami(West)F ∨ weap(Y1 )F ∨ sell(West, Y1 , Z1 )F ∨ host(Z1 )F
mle(X3 )F ∨ weap(X3 )T weap(Y1 )F ∨ sell(West, Y1 , Z1 )F ∨ host(Z1 )F

[Y1 /X3 ]
T
mle(c) mle(Y1 ) ∨ sell(West, Y1 , Z1 ) ∨ host(Z1 )F
F F
[c/Y1 ]
mle(X2 )F ∨ own(NN, X2 )F ∨ sell(West, X2 , NN)T
sell(West, c, Z1 )F ∨ host(Z1 )F [c/X2 ]

[NN/Z1 ]
mle(c)T mle(c)F ∨ own(NN, c)F ∨ host(NN)F
own(NN, c)T own(NN, c)F ∨ host(NN)F
enmy(X4 , USA)F ∨ host(X4 )T host(NN)F

[NN/X4 ]
enmy(NN, USA)T enmy(NN, USA)F

2
Curiosity Killed the Cat?

Example 17.2.5. From [RN09]
Everyone who loves all animals is loved by someone.
Anyone who kills an animal is loved by noone.
Jack loves all animals.
Cats are animals.
Either Jack or curiosity killed the cat (whose name is “Garfield”).
Prove that curiosity killed the cat.
Curiosity Killed the Cat? Clauses

Everyone who loves all animals is loved by someone:
∀X (∀Y animal(Y ) ⇒ love(X, Y )) ⇒ (∃ love(Z, X))
T T F T
Clauses: animal(g(X1 )) ∨love(g(X1 ), X1 ) and love(X2 , f (X2 )) ∨love(g(X2 ), X2 )
Anyone who kills an animal is loved by noone:
∀X ∃Y animal(Y ) ∧ kill(X, Y ) ⇒ (∀ ¬love(Z, X))
F F F
Clause: animal(Y3 ) ∨ kill(X3 , Y3 ) ∨ love(Z3 , X3 )
Jack loves all animals:
F T
Clause: animal(X4 ) ∨ love(jack, X4 )
Cats are animals:

F T
Clause: cat(X5 ) ∨ animal(X5 )
Either Jack or curiosity killed the cat (whose name is “Garfield”):
T T T
Clauses: kill(jack, garf) ∨ kill(curiosity, garf) and cat(garf)
Curiosity Killed the Cat! PL1 Resolution Proof
cat(garf)T cat(X5 )F ∨ animal(X5 )T

[garf/X5 ]
animal(garf)T animal(Y3 )F ∨ kill(X3 , Y3 )F ∨ love(Z3 , X3 )F

[garf/Y3 ]
kill(X3 , garf)F ∨ love(Z3 , X3 )F kill(jack, garf)T ∨ kill(curiosity, garf)T kill(curiosity, garf)F
[jack/X3 ] kill(jack, garf)T
love(Z3 , jack)F love(X2 , f (X2 ))F ∨ love(g(X2 ), X2 )T animal(X4 )F ∨ love(jack, X4 )T

[jack/X2 ], [f (jack)/X4 ]
love(g(jack), jack)T ∨ animal(f (jack))Fanimal(f (X1 ))T ∨ love(g(X1 ), X1 )T

[g(jack)/Z3 ]
[jack/X1 ]
love(g(jack), jack)T
Excursion: A full analysis of any calculus needs a completeness proof. We will not cover this in
the course, but provide one for the calculi introduced so far in??.
17.3 Logic Programming as Resolution Theorem Proving

To understand ProLog better, we can interpret the language of ProLog as resolution in PL1 .
We know all this already

Goals, goal sets, rules, and facts are just clauses. (called Horn clauses)
Observation 17.3.1 (Rule).

H:−B 1 ,. . .,B n . corresponds to H T ∨ B 1 F ∨ . . . ∨ B n F (head the only positive
literal)
Observation 17.3.2 (Goal set).
?− G1 ,. . .,Gn . corresponds to G1 F ∨ . . . ∨ Gn F
Observation 17.3.3 (Fact). F . corresponds to the unit clause F T .

17.3. LOGIC PROGRAMMING AS RESOLUTION THEOREM PROVING 295
Definition 17.3.4. A Horn clause is a clause with at most one positive literal.
Recall: Backchaining as search:
state = tuple of goals; goal state = empty list (of goals).
next(⟨G, R1 , . . ., Rl ⟩):=⟨σ(B 1 ), . . ., σ(B m ), σ(R1 ), . . ., σ(Rl )⟩ if there is a rule
H:−B 1 ,. . ., B m . and a substitution σ with σ(H) = σ(G).
Note: Backchaining becomes resolution
PT ∨ A PF ∨ B
A∨B
positive, unit-resulting hyperresolution (PURR)
This observation helps us understand ProLog better, and use implementation techniques from
theorem proving.
PROLOG (Horn Logic)

Definition 17.3.5.
A clause is called a Horn clause, iff contains at most one positive literal, i.e. if it is
of the form B 1 F ∨ . . . ∨ B n F ∨ AT – i.e. A:−B 1 ,. . .,B n . in ProLog notation.
Rule clause: general case, e.g. fallible(X) : human(X).
Fact clause: no negative literals, e.g. human(sokrates).
Program: set of rule and fact clauses.
Query: no positive literals: e.g. ?− fallible(X),greek(X).
Definition 17.3.6. Horn logic is the formal system whose language is the set of
Horn clauses together with the calculus H given by MP, ∧I, and Subst.
Definition 17.3.7. A logic program P entails a query Q with answer substitution
σ, iff there is a H derivation D of Q from P and σ is the combined substitution of
the Subst instances in D.
PROLOG: Our Example

Program:
human(leibniz).
human(sokrates).
greek(sokrates).
Example 17.3.8 (Query). ?− fallible(X),greek(X).

Answer substitution: [sokrates/X]
To gain an intuition for this quite abstract definition let us consider a concrete knowledge base
about cars. Instead of writing down everything we know about cars, we only write down that cars
are motor vehicles with four wheels and that a particular object c has a motor and four wheels. We
can see that the fact that c is a car can be derived from this. Given our definition of a knowledge
base as the deductive closure of the facts and rule explicitly written down, the assertion that c is
a car is in the induced knowledge base, which is what we are after.
Knowledge Base (Example)

Example 17.3.9. car(c). is in the knowlege base generated by
has_motor(c).
has_wheels(c,4).
car(X):− has_motor(X),has_wheels(X,4).
m(c) w(c, 4) m(x) ∧ w(x, 4) ⇒ car()

∧I Subst
m(c) ∧ w(c, 4) m(c) ∧ w(c, 4) ⇒ car()
MP
car(c)
In this very simple example car(c) is about the only fact we can derive, but in general, knowledge
bases can be infinite (we will see examples below).
Why Only Horn Clauses?

General clauses of the form A1,. . .,An : B1,. . .,Bn.
e.g. greek(sokrates),greek(perikles)
Question: Are there fallible greeks?
Indefinite answer: Yes, Perikles or Sokrates
Warning: how about Sokrates and Perikles?
e.g. greek(sokrates),roman(sokrates):−.
Query: Are there fallible greeks?
Answer: Yes, Sokrates, if he is not a roman
Is this abduction?????
Three Principal Modes of Inference

Definition 17.3.10. Deduction =
b knowledge extension
17.3. LOGIC PROGRAMMING AS RESOLUTION THEOREM PROVING 297
rains ⇒ wet_street rains

Example 17.3.11. D
wet_street
Definition 17.3.12. Abduction =
b explanation
rains ⇒ wet_street wet_street
Example 17.3.13. A
rains
Definition 17.3.14. Induction =
b learning general rules from examples
wet_street rains
Example 17.3.15. I
rains ⇒ wet_street

Chapter 18
Knowledge Representation and the

Semantic Web
The field of “Knowledge Representation” is usually taken to be an area in Artificial Intelligence

that studies the representation of knowledge in formal systems and how to leverage inference
techniques to generate new knowledge items from existing ones. Note that this definition
coincides with with what we know as logical systems in this course. This is the view taken by
the subfield of “description logics”, but restricted to the case, where the logical systems have an
entailment relation to ensure applicability. This chapter is organized as follows. We will first
give a general introduction to the concepts of knowledge representation using semantic networks –
an early and very intuitive approach to knowledge representation – as an object-to-think-with. In
section 18.2 we introduce the principles and services of logic-based knowledge-representation using
a non-standard interpretation of propositional logic as the basis, this gives us a formal account of
the taxonomic part of semantic networks. In ?? we introduce the logic ALC that adds relations
(called “roles”) and restricted quantification and thus gives us the full expressive power of semantic
networks. Thus ALC can be seen as a prototype description logic. In section 18.4 we show how
description logics are applied as the basis of the “semantic web”.
18.1 Introduction to Knowledge Representation

A Video Nugget covering the introduction to knowledge representation can be found at https:
//fau.tv/clip/id/27279. Before we start into the development of description logics, we set
the stage by looking into alternatives for knowledge representation.
18.1.1 Knowledge & Representation

To approach the question of knowledge representation, we first have to ask ourselves, what
knowledge might be. This is a difficult question that has kept philosophers occupied for millennia.
We will not answer this question in this course, but only allude to and discuss some aspects that
are relevant to our cause of knowledge representation.
What is knowledge? Why Representation?

Lots/all of (academic) disciplines deal with knowledge!
According to Probst/Raub/Romhardt [PRR97]
299
300 CHAPTER 18. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB
For the purposes of this course: Knowledge is the information necessary to

support intelligent reasoning!
representation can be used to determine

set of words whether a word is admissible
list of words the rank of a word
a lexicon translation and/or grammatical function
structure function
According to an influential view of [PRR97], knowledge appears in layers. Staring with a character
set that defines a set of glyphs, we can add syntax that turns mere strings into data. Adding context
information gives information, and finally, by relating the information to other information allows
to draw conclusions, turning information into knowledge.
Note that we already have aspects of representation and function in the diagram at the top of the
slide. In this, the additional functionaltiy added in the successive layers gives the representations
more and more functions, until we reach the knowledge level, where the function is given by infer-
encing. In the second example, we can see that representations determine possible functions.
Let us now strengthen our intuition about knowledge by contrasting knowledge representations
from “regular” data structures in computation.
Knowledge Representation vs. Data Structures

Idea: Representation as structure and function.
the representation determines the content theory (what is the data?)
the function determines the process model (what do we do with the data?)
Question: Why do we use the term “knowledge representation” rather than
data structures? (sets, lists, ... above)
information representation? (it is information)
Answer:
No good reason other than AI practice, with the intuition that
data is simple and general (supports many algorithms)
knowledge is complex (has distinguished process model)
As knowledge is such a central notion in artificial intelligence, it is not surprising that there are
multiple approaches to dealing with it. We will only deal with the first one and leave the others
to self-study.
18.1. INTRODUCTION TO KNOWLEDGE REPRESENTATION 301
Some Paradigms for Knowledge Representation in AI/NLP
GOFAI (good old-fashioned AI)

symbolic knowledge representation, process model based on heuristic search
Statistical, corpus-based approaches.
symbolic representation, process model based on machine learning

knowledge is divided into symbolic- and statistical (search) knowledge
The connectionist approach
sub-symbolic representation, process model based on primitive processing ele-
ments (nodes) and weighted links
knowledge is only present in activation patters, etc.
When assessing the relative strengths of the respective approaches, we should evaluate them with
respect to a pre-determined set of criteria.
KR Approaches/Evaluation Criteria
Definition 18.1.1. The evaluation criteria for knowledge representation approaches
are:
Expressive adequacy: What can be represented, what distinctions are supported.
Reasoning efficiency: Can the representation support processing that generates
results in acceptable speed?
Primitives: What are the primitive elements of representation, are they intuitive,
cognitively adequate?
Meta representation: Knowledge about knowledge
Completeness: The problems of reasoning with knowledge that is known to be
incomplete.
18.1.2 Semantic Networks

To get a feeling for early knowledge representation approaches from which description logics de-
veloped, we take a look at “semantic networks” and contrast them to logical approaches.
Semantic networks are a very simple way of arranging knowledge about objects and concepts and
their relationships in a graph.
Semantic Networks [CQ69]

Definition 18.1.2. A semantic network is a directed graph for representing knowl-
edge:
nodes represent objects and concepts (classes of objects)

(e.g. John (object) and bird (concept))
edges (called links) represent relations between these (isa, father_of,
belongs_to)
Example 18.1.3. A semantic network for birds and persons:
bird Jack Person

isa inst
inst inst
has_part robin owner_of Mary
loves
wings John
Problem: How do we derive new information from such a network?

Idea: Encode taxonomic information about objects and concepts in special links
(“isa” and “inst”) and specify property inheritance along them in the process model.
Even though the network in Example 18.1.3 is very intuitive (we immediately understand the
concepts depicted), it is unclear how we (and more importantly a machine that does not asso-
ciate meaning with the labels of the nodes and edges) can draw inferences from the “knowledge”
represented.
Deriving Knowledge Implicit in Semantic Networks

Observation 18.1.4. There is more knowledge in a semantic network than is
explicitly written down.
Example 18.1.5. In the network below, we “know” that robins have wings and in
particular, Jack has wings.
bird Jack Person

isa inst
inst inst
loves
wings John
Idea: Links labeled with “isa” and “inst” are special: they propagate properties
encoded by other links.
Definition 18.1.6. We call links labeled by
“isa” an inclusion or isa link (inclusion of concepts)
“inst” instance or inst link (concept membership)
We now make the idea of “propagating properties” rigorous by defining the notion of derived
relations, i.e. the relations that are left implicit in the network, but can be added without changing
its meaning.
Deriving Knowledge Semantic Networks

Definition 18.1.7 (Inference in Semantic Networks). We call all link labels
except “inst” and “isa” in a semantic network relations.
isa R
Let N be a semantic network and R a relation in N such that A −→ B −→ C or
inst R R
A −→ B −→ C, then we can derive a relation A −→ C in N .
The process of deriving new concepts and relations from existing ones is called
inference and concepts/relations that are only available via inference implicit (in a
semantic network).
Intuition: Derived relations represent knowledge that is implicit in the network;
they could be added, but usually are not to avoid clutter.
Example 18.1.8. Derived relations in Example 18.1.5
isa
bird / Jack Person
isa inst
inst inst
has_part
has_part loves
wings John
Slogan: Get out more knowledge from a semantic networks than you put in.
Note that Definition 18.1.7 does not quite allow to derive that Jack is a bird (did you spot that
“isa” is not a relation that can be inferred?), even though we know it is true in the world. This
shows us that inference in semantic networks has be to very carefully defined and may not be
“complete”, i.e. there are things that are true in the real world that our inference procedure does
not capture.
Dually, if we are not careful, then the inference procedure might derive properties that are not
true in the real world even if all the properties explicitly put into the network are. We call such
an inference procedure unsound or incorrect.
These are two general phenomena we have to keep an eye on.
Another problem is that semantic nets (e.g. in in Example 18.1.3) confuse two kinds of concepts:
individuals (represented by proper names like John and Jack) and concepts (nouns like robin and
bird). Even though the isa and inst link already acknowledge this distinction, the “has_part” and
“loves” relations are at different levels entirely, but not distinguished in the networks.
Terminologies and Assertions

Remark 18.1.9. We should distinguish concepts from objects.
Definition 18.1.10. We call the subgraph of a semantic network N spanned by the
isa links and relations between concepts the terminology (or TBox, or the famous
Isa Hierarchy) and the subgraph spanned by the inst links and relations between
objects, the assertions (or ABox) of N .
Example 18.1.11. In this network we keep objects concept apart notationally:
can
animal move
TBox isa isa
amoeba
has_part higher animal has_part
legs head
isa isa
pattern eat color
striped tiger elephant gray
inst inst inst color
eat
ABox Roy eat Rex Clyde
In particular we have objects “Rex”, “Roy”, and “Clyde”, which have (derived) rela-
tions (e.g. Clyde is gray).
But there are severe shortcomings of semantic networks: the suggestive shape and node names
give (humans) a false sense of meaning, and the inference rules are only given in the process model
(the implementation of the semantic network processing system).
This makes it very difficult to assess the strength of the inference system and make assertions
e.g. about completeness.
Limitations of Semantic Networks

What is the meaning of a link?
link labels are very suggestive (misleading for humans)
meaning of link types defined in the process model (no denotational semantics)
Problem: No distinction of optional and defining traits!
Example 18.1.12. Consider a robin that has lost its wings in an accident:
has_part has_part
bird wings bird wings
isa isa
robin robin cancel
inst inst
jack joe
“Cancel-links” have been proposed, but their status and process model are debatable.
To alleviate the perceived drawbacks of semantic networks, we can contemplate another notation
that is more linear and thus more easily implemented: function/argument notation.
Another Notation for Semantic Networks

Definition 18.1.13. Function/argument notation for semantic networks
interprets nodes as arguments (reification to individuals)
interprets links as functions (predicates actually)
Example 18.1.14.
bird Jack Person isa(robin,bird)

isa inst
inst inst haspart(bird,wings)
has_part robin owner_of Mary inst(Jack,robin)
owner_of(John, robin)
loves
wings John loves(John,Mary)
Evaluation:
+ linear notation (equivalent, but better to implement on a computer)
+ easy to give process model by deduction (e.g. in ProLog)
– worse locality properties (networks are associative)
Indeed the function/argument notation is the immediate idea how one would naturally represent
semantic networks for implementation.
This notation has been also characterized as subject/predicate/object triples, alluding to simple
(English) sentences. This will play a role in the “semantic web” later.
Building on the function/argument notation from above, we can now give a formal semantics for
semantic network: we translate them into first-order logic and use the semantics of that.
A Denotational Semantics for Semantic Networks

Observation: If we handle isa and inst links specially in function/argument nota-
tion
bird Jack Person robin ⊆ bird
isa inst
inst inst haspart(bird,wings)
has_part robin owner_of Mary Jack∈robin
owner_of(John, Jack)
loves
wings John loves(John,Mary)
it looks like first-order logic, if we take
a∈S to mean S(a) for an object a and a concept S.
A ⊆ B to mean ∀X A(X) ⇒ B(X) and concepts A and B
R(A, B) to mean ∀X A(X) ⇒ (∃Y B(Y ) ∧ R(X, Y )) for a relation R.
Idea: Take first-order deduction as process model (gives inheritance for free)
Indeed, the semantics induced by the translation to first-order logic, gives the intuitive meaning
to the semantic networks. Note that this only holds only for the features of semantic networks that
are representable in this way, e.g. the “cancel links” shown above are not (and that is a feature,
not a bug).
But even more importantly, the translation to first-order logic gives a first process model: we
can use first-order inference to compute the set of inferences that can be drawn from a semantic
network.
Before we go on, let us have a look at an important application of knowledge representation

technologies: the semantic web.
18.1.3 The Semantic Web

We will now define the term semantic web and discuss the pertinent ideas involved. There are two
central ones, we will cover here:
• Information and data come in different levels of explicitness; this is usually visualized by a
“ladder” of information.
• if information is sufficiently machine-understandable, then we can automate drawing conclu-
sions.
The Semantic Web

Definition 18.1.15. The semantic web is the result including of semantic content
in web pages with the aim of converting the WWW into a machine-understandable
“web of data”, where inference based services can add value to the ecosystem.
Idea: Move web content up the ladder, use inference to make connections.
Example 18.1.16. Information not explicitly represented (in one place)

Query: Who was US president when Barak Obama was born?
Google: . . . BIRTH DATE: August 04, 1961. . .
Query: Who was US president in 1961?
Google: President: Dwight D. Eisenhower [. . . ] John F. Kennedy (starting Jan. 20.)
Humans understand the text and combine the information to get the answer. Ma-
chines need more than just text ; semantic web technology.
The term “semantic web” was coined by Tim Berners Lee in analogy to semantic networks, only
applied to the world wide web. And as for semantic networks, where we have inference processes
that allow us the recover information that is not explicitly represented from the network (here the
world-wide-web).
To see that problems have to be solved, to arrive at the semantic web, we will now look at a
concrete example about the “semantics” in web pages. Here is one that looks typical enough.
What is the Information a User sees?

Example 18.1.17. Take the following web-site with a conference announcement
WWW2002
The eleventh International World Wide Web Conference
Sheraton Waikiki Hotel
Honolulu, Hawaii, USA
7-11 May 2002
Registered participants coming from

Australia, Canada, Chile Denmark, France, Germany, Ghana, Hong Kong, In-
dia,
Ireland, Italy, Japan, Malta, New Zealand, The Netherlands, Norway,
Singapore, Switzerland, the United Kingdom, the United States, Vietnam, Zaire
On the 7th May Honolulu will provide the backdrop of the eleventh
International World Wide Web Conference.
Speakers confirmed
Tim Berners-Lee: Tim is the well known inventor of the Web,
Ian Foster: Ian is the pioneer of the Grid, the next generation internet.
But as for semantic networks, what you as a human can see (“understand” really) is deceptive, so
let us obfuscate the document to confuse your “semantic processor”. This gives an impression of
what the computer “sees”.
What the machine sees

Example 18.1.18. Here is what the machine “sees” from the conference announce-
ment:
WWW∈′′∈
T⟨⌉⌉↕⌉⊑⌉\⊔⟨I\⊔⌉∇\⊣⊔⟩≀\⊣↕W≀∇↕⌈W⟩⌈⌉W⌉⌊C≀\{⌉∇⌉\⌋⌉
S⟨⌉∇⊣⊔≀\W⊣⟩∥⟩∥⟩H≀⊔⌉↕
H≀\≀↕⊓↕⊓⇔H⊣⊒⊣⟩⟩⇔USA
7↖∞∞M⊣†∈′′∈
R⌉}⟩∫⊔⌉∇⌉⌈√⊣∇⊔⟩⌋⟩√⊣\⊔∫⌋≀⇕⟩\}{∇≀⇕
A⊓∫⊔∇⊣↕⟩⊣⇔C⊣\⊣⌈⊣⇔C⟨⟩↕⌉D⌉\⇕⊣∇∥⇔F∇⊣\⌋⌉⇔G⌉∇⇕⊣\†⇔G⟨⊣\⊣⇔H≀\}K≀\}⇔I\⌈⟩⊣⇔
I∇⌉↕⊣\⌈⇔I⊔⊣↕†⇔J⊣√⊣\⇔M⊣↕⊔⊣⇔N⌉⊒Z⌉⊣↕⊣\⌈⇔T⟨⌉N⌉⊔⟨⌉∇↕⊣\⌈∫⇔N≀∇⊒⊣†⇔
S⟩\}⊣√≀∇⌉⇔S⊒⟩⊔‡⌉∇↕⊣\⌈⇔⊔⟨⌉U\⟩⊔⌉⌈K⟩\}⌈≀⇕⇔⊔⟨⌉U\⟩⊔⌉⌈S⊔⊣⊔⌉∫⇔V⟩⌉⊔\⊣⇕⇔Z⊣⟩∇⌉
O\⊔⟨⌉7⊔⟨M⊣†H≀\≀↕⊓↕⊓⊒⟩↕↕√∇≀⊑⟩⌈⌉⊔⟨⌉⌊⊣⌋∥⌈∇≀√≀{⊔⟨⌉⌉↕⌉⊑⌉\⊔⟨
I\⊔⌉∇\⊣⊔⟩≀\⊣↕W≀∇↕⌈W⟩⌈⌉W⌉⌊C≀\{⌉∇⌉\⌋⌉↙
S√⌉⊣∥⌉∇∫⌋≀\{⟩∇⇕⌉⌈
T⟩⇕B⌉∇\⌉∇∫↖L⌉⌉¬T⟩⇕⟩∫⊔⟨⌉⊒⌉↕↕∥\≀⊒\⟩\⊑⌉\⊔≀∇≀{⊔⟨⌉W⌉⌊⇔
I⊣\F≀∫⊔⌉∇¬I⊣\⟩∫⊔⟨⌉√⟩≀\⌉⌉∇≀{⊔⟨⌉G∇⟩⌈⇔⊔⟨⌉\⌉§⊔}⌉\⌉∇⊣⊔⟩≀\⟩\⊔⌉∇\⌉⊔↙
Obviously, there is not much the computer understands, and as a consequence, there is not a lot
the computer can support the reader with. So we have to “help” the computer by providing some
meaning. Conventional wisdom is that we add some semantic/functional markup. Here we pick
XML without loss of generality, and characterize some fragments of text e.g. as dates.
Solution: XML markup with “meaningful” Tags

Example 18.1.19. Let’s annotate (parts of) the meaning via XML markup
<title>WWW∈′′∈
T⟨⌉⌉↕⌉⊑⌉\⊔⟨I\⊔⌉∇\⊣⊔⟩≀\⊣↕W≀∇↕⌈W⟩⌈⌉W⌉⌊C≀\{⌉∇⌉\⌋⌉</title>
<place>S⟨⌉∇⊣⊔≀\W⊣⟩∥⟩∥⟩H≀⊔⌉↕H≀\≀↕⊓↕⊓⇔H⊣⊒⊣⟩⟩⇔USA</place>
<date>7↖∞∞M⊣†∈′′∈</date>
<participants>R⌉}⟩∫⊔⌉∇⌉⌈√⊣∇⊔⟩⌋⟩√⊣\⊔∫⌋≀⇕⟩\}{∇≀⇕
A⊓∫⊔∇⊣↕⟩⊣⇔C⊣\⊣⌈⊣⇔C⟨⟩↕⌉D⌉\⇕⊣∇∥⇔F∇⊣\⌋⌉⇔G⌉∇⇕⊣\†⇔G⟨⊣\⊣⇔H≀\}K≀\}⇔I\⌈⟩⊣⇔
I∇⌉↕⊣\⌈⇔I⊔⊣↕†⇔J⊣√⊣\⇔M⊣↕⊔⊣⇔N⌉⊒Z⌉⊣↕⊣\⌈⇔T⟨⌉N⌉⊔⟨⌉∇↕⊣\⌈∫⇔N≀∇⊒⊣†⇔
S⟩\}⊣√≀∇⌉⇔S⊒⟩⊔‡⌉∇↕⊣\⌈⇔⊔⟨⌉U\⟩⊔⌉⌈K⟩\}⌈≀⇕⇔⊔⟨⌉U\⟩⊔⌉⌈S⊔⊣⊔⌉∫⇔V⟩⌉⊔\⊣⇕⇔Z⊣⟩∇⌉
</participants>
<introduction>O\⊔⟨⌉7⊔⟨M⊣†H≀\≀↕⊓↕⊓⊒⟩↕↕√∇≀⊑⟩⌈⌉⊔⟨⌉⌊⊣⌋∥⌈∇≀√≀{⊔⟨⌉⌉↕⌉⊑⌉\⊔⟨I\⊔⌉∇↖
\⊣⊔⟩≀\⊣↕W≀∇↕⌈W⟩⌈⌉W⌉⌊C≀\{⌉∇⌉\⌋⌉↙</introduction>
<program>S√⌉⊣∥⌉∇∫⌋≀\{⟩∇⇕⌉⌈
<speaker>T⟩⇕B⌉∇\⌉∇∫↖L⌉⌉¬T⟩⇕⟩∫⊔⟨⌉⊒⌉↕↕∥\≀⊒\⟩\⊑⌉\⊔≀∇≀{⊔⟨⌉W⌉⌊</speaker>
<speaker>I⊣\F≀∫⊔⌉∇¬I⊣\⟩∫⊔⟨⌉√⟩≀\⌉⌉∇≀{⊔⟨⌉G∇⟩⌈⇔⊔⟨⌉\⌉§⊔}⌉\⌉∇⊣⊔⟩≀\⟩\⊔⌉∇\⌉⊔<speaker>
</program>
But does this really help? Is conventional wisdom correct?
What can we do with this?

Example 18.1.20. Consider the following fragments:
ℜ⊔⟩⊔↕⌉⊤WWW∈′′∈
T⟨⌉⌉↕⌉⊑⌉\⊔⟨I\⊔⌉∇\⊣⊔⟩≀\⊣↕W≀∇↕⌈W⟩⌈⌉W⌉⌊C≀\{⌉∇⌉\⌋⌉ℜ∝⊔⟩⊔↕⌉⊤
ℜ√↕⊣⌋⌉⊤S⟨⌉∇⊣⊔≀\W⊣⟩∥⟩∥⟩H≀⊔⌉↕H≀\≀↕⊓↕⊓⇔H⊣⊒⊣⟩⟩⇔USAℜ∝√↕⊣⌋⌉⊤
ℜ⌈⊣⊔⌉⊤7↖∞∞M⊣†∈′′∈ℜ∝⌈⊣⊔⌉⊤
Given the markup above, a machine agent can
parse 7∞∞M⊣†∈′′∈ as the date May 7 11 2002 and add this to the user’s calendar,
parse S⟨⌉∇⊣⊔≀\W⊣⟩∥⟩∥⟩H≀⊔⌉↕H≀\≀↕⊓↕⊓⇔H⊣⊒⊣⟩⟩⇔USA as a destination and find flights.
But: do not be deceived by your ability to understand English!
To understand what a machine can understand we have to obfuscate the markup as well, since it
does not carry any intrinsic meaning to the machine either.
What the machine sees of the XML

Example 18.1.21. Here is what the machine sees of the XML
<title>WWW∈′′∈
T⟨⌉⌉↕⌉⊑⌉\⊔⟨I\⊔⌉∇\⊣⊔⟩≀\⊣↕W≀∇↕⌈W⟩⌈⌉W⌉⌊C≀\{⌉∇⌉\⌋⌉</⊔⟩⊔↕⌉>
<√↕⊣⌋⌉>S⟨⌉∇⊣⊔≀\W⊣⟩∥⟩∥⟩H≀⊔⌉↕H≀\≀↕⊓↕⊓⇔H⊣⊒⊣⟩⟩⇔USA</√↕⊣⌋⌉>
<⌈⊣⊔⌉>7↖∞∞M⊣†∈′′∈</⌈⊣⊔⌉>
<√⊣∇⊔⟩⌋⟩√⊣\⊔∫ >R⌉}⟩∫⊔⌉∇⌉⌈√⊣∇⊔⟩⌋⟩√⊣\⊔∫⌋≀⇕⟩\}{∇≀⇕
A⊓∫⊔∇⊣↕⟩⊣⇔C⊣\⊣⌈⊣⇔C⟨⟩↕⌉D⌉\⇕⊣∇∥⇔F∇⊣\⌋⌉⇔G⌉∇⇕⊣\†⇔G⟨⊣\⊣⇔H≀\}K≀\}⇔I\⌈⟩⊣⇔
I∇⌉↕⊣\⌈⇔I⊔⊣↕†⇔J⊣√⊣\⇔M⊣↕⊔⊣⇔N⌉⊒Z⌉⊣↕⊣\⌈⇔T⟨⌉N⌉⊔⟨⌉∇↕⊣\⌈∫⇔N≀∇⊒⊣†⇔
S⟩\}⊣√≀∇⌉⇔S⊒⟩⊔‡⌉∇↕⊣\⌈⇔⊔⟨⌉U\⟩⊔⌉⌈K⟩\}⌈≀⇕⇔⊔⟨⌉U\⟩⊔⌉⌈S⊔⊣⊔⌉∫⇔V⟩⌉⊔\⊣⇕⇔Z⊣⟩∇⌉
</√⊣∇⊔⟩⌋⟩√⊣\⊔∫ >
<⟩\⊔∇≀⌈⊓⌋⊔⟩≀\>O\⊔⟨⌉7⊔⟨M⊣†H≀\≀↕⊓↕⊓⊒⟩↕↕√∇≀⊑⟩⌈⌉⊔⟨⌉⌊⊣⌋∥⌈∇≀√≀{⊔⟨⌉⌉↕⌉⊑⌉\⊔⟨I\⊔⌉∇\⊣↖
⊔⟩≀\⊣↕W≀∇↕⌈W⟩⌈⌉W⌉⌊C≀\{⌉∇⌉\⌋⌉↙</⟩\⊔∇≀⌈⊓⌋⊔⟩≀\>
<√∇≀}∇⊣⇕>S√⌉⊣∥⌉∇∫⌋≀\{⟩∇⇕⌉⌈
<∫√⌉⊣∥⌉∇>T⟩⇕B⌉∇\⌉∇∫↖L⌉⌉¬T⟩⇕⟩∫⊔⟨⌉⊒⌉↕↕∥\≀⊒\⟩\⊑⌉\⊔≀∇≀{⊔⟨⌉W⌉⌊</∫√⌉⊣∥⌉∇>
<∫√⌉⊣∥⌉∇>I⊣\F≀∫⊔⌉∇¬I⊣\⟩∫⊔⟨⌉√⟩≀\⌉⌉∇≀{⊔⟨⌉G∇⟩⌈⇔⊔⟨⌉\⌉§⊔}⌉\⌉∇⊣⊔⟩≀\⟩\⊔⌉∇\⌉⊔<∫√⌉⊣∥⌉∇>
</√∇≀}∇⊣⇕>
So we have not really gained much either with the markup, we really have to give meaning to the
markup as well, this is where techniques from semenatic web come into play.
To understand how we can make the web more semantic, let us first take stock of the current status
of (markup on) the web. It is well-known that world-wide-web is a hypertext, where multimedia
documents (text, images, videos, etc. and their fragments) are connected by hyperlinks. As we
have seen, all of these are largely opaque (non-understandable), so we end up with the following
situation (from the viewpoint of a machine).
The Current Web

Resources: identified by
URIs, untyped
Links: href, src, . . . limited,

non-descriptive
User: Exciting world - se-
mantics of the resource, how-
ever, gleaned from content
Machine: Very little infor-

mation available - significance
of the links only evident from
the context around the anchor.
Let us now contrast this with the envisioned semantic web.
The Semantic Web

Resources: Globally identi-

fied by URIs or Locally scoped
(Blank), Extensible, Relational
Links: Identified by URIs, Ex-
tensible, Relational
User: Even more exciting

world, richer user experience
Machine: More processable
information is available (Data
Web)
Computers and peo-
ple: Work, learn and
exchange knowledge effec-
tively
Essentially, to make the web more machine-processable, we need to classify the resources by the
concepts they represent and give the links a meaning in a way, that we can do inference with that.
The ideas presented here gave rise to a set of technologies jointly called the “semantic web”, which
we will now summarize before we return to our logical investigations of knowledge representation
techniques.
Towards a “Machine-Actionable Web”

Recall: We need external agreement on meaning of annotation tags.
Idea: standardize them in a community process (e.g. DIN or ISO)
Problem: Inflexible, Limited number of things can be expressed
Better: Use ontologies to specify meaning of annotations

Ontologies provide a vocabulary of terms
New terms can be formed by combining existing ones
Meaning (semantics) of such terms is formally specified
Can also specify relationships between terms in multiple ontologies
Inference with annotations and ontologies (get out more than you put in!)
Standardize annotations in RDF [KC04] or RDFa [Her+13b] and ontologies on
OWL [OWL09]
Harvest RDF and RDFa in to a triplestore or OWL reasoner.
Query that for implied knowledge (e.g. chaining multiple facts from Wikipedia)
SPARQL: Who was US President when Barack Obama was Born?
DBPedia: John F. Kennedy (was president in August 1961)

18.1.4 Other Knowledge Representation Approaches

Now that we know what semantic networks mean, let us look at a couple of other approaches that
were influential for the development of knowledge representation. We will just mention them for
reference here, but not cover them in any depth.
Frame Notation as Logic with Locality
Predicate Logic: (where is the locality?)

catch_22∈catch_object There is an instance of catching
catcher(catch_22, jack_2) Jack did the catching
caught(catch_22, ball_5) He caught a certain ball
Definition 18.1.22. Frames (group everything around the object)

(catch_object catch_22
(catcher jack_2)
(caught ball_5))
+ Once you have decided on a frame, all the information is local

+ easy to define schemes for concepts (aka. types in feature structures)
– how to determine frame, when to choose frame (log/chair)
KR involving Time (Scripts [Shank ’77])

Idea: Organize typical event sequences, actors and props into representation.
Definition 18.1.23. A script is a struc-

tured representation describing a stereotyped
sequence of events in a particular con-
text. Structurally, scripts are very much like make appointment
frames, except the values that fill the slots
go into beauty parlor
must be ordered.
Example 18.1.24. getting your hair cut (at tell receptionist you’re here
a beauty parlor)
Beautician cuts hair
props, actors as “script variables”
pay
events in a (generalized) sequence
happy unhappy
use script material for
big tip small tip
anaphora, bridging references
default common ground
to fill in missing material into situations

Other Representation Formats (not covered)
Procedural Representations (production systems)

Analogical representations (interesting but not here)
Iconic representations (interesting but very difficult to formalize)
If you are interested, come see me off-line
18.2 Logic-Based Knowledge Representation

We now turn to knowledge representation approaches that are based on some kind of logical
system. These have the advantage that we know exactly what we are doing: as they are based
on symbolic representations and declaratively given inference calculi as process models, we can
inspect them thoroughly and even prove facts about them.
Logic-Based Knowledge Representation

Logic (and related formalisms) have a well-defined semantics
explicitly (gives more understanding than statistical/neural methods)
transparently (symbolic methods are monotonic)
systematically (we can prove theorems about our systems)
Problems with logic-based approaches

Where does the world knowledge come from? (Ontology problem)
How to guide search induced by logical calculi (combinatorial explosion)
One possible answer: description logics. (next couple of times)
But of course logic-based approaches have big drawbacks as well. The first is that we have to obtain
the symbolic representations of knowledge to do anything – a non-trivial challenge, since most
knowledge does not exist in this form in the wild, to obtain it, some agent has to experience the
word, pass it through its cognitive apparatus, conceptualize the phenomena involved, systematize
them sufficiently to form symbols, and then represent those in the respective formalism at hand.
The second drawback is that the process models induced by logic-based approaches (inference
with calculi) are quite intractable. We will see that all inferences can be played back to satisfiability
tests in the underlying logical system, which are exponential at best, and undecidable or even
incomplete at worst.
Therefore a major thrust in logic-based knowledge representation is to investigate logical sys-
tems that are expressive enough to be able to represent most knowledge, but still have a decidable
– and maybe even tractable in practice – satisfiability problem. Such logics are called “description
logics”. We will study the basics of such logical systems and their inference procedures in the
following.
18.2. LOGIC-BASED KNOWLEDGE REPRESENTATION 313
18.2.1 Propositional Logic as a Set Description Language

Before we look at “real” description logics in ??, we will make a “dry run” with a logic we
already understand: propositional logic, which we will re-interpret the system as a set description
language by giving a new, non-standard semantics. This allows us to already preview most of
the inference procedures and knowledge services of knowledge representation systems in the next
subsection.
To establish propositional logic as a set description language, we use a different interpretation than
usual. We interpret propositional variables as names of sets and the connectives as set operations,
which is why we give them a different – more suggestive – syntax.
Propositional Logic as Set Description Language
Idea: Use propositional logic as a set description language: (variant

syntax/semantics)
Definition 18.2.1. Let PL0DL be given by the following grammar for the PL0DL
concepts. (formulae)
L::=C | ⊤ | ⊥ | L | L ⊓ L | L ⊔ L | L ⊑ L | L ≡ L
i.e. PL0DL formed from

atomic (=
b propositional variables)
concept intersection (⊓) (=
b conjunction ∧)
concept complement (·) (=
b negation ¬)
concept union (⊔), subsumption (⊑), and equality (≡) defined from these. (=
b
∨, ⇒, and ⇔)
Definition 18.2.2 (Formal Semantics).
Let D be a given set (called the domain) and φ : V0 →P(D), then we define
[ P ] :=φ(P ), (remember φ(P ) ⊆ D).

[ A ⊓ B]] := [ A]] ∩ [ B]] and A :=D\ [ A]] . . .
Note: ⟨PL0DL , S, [ ·]]⟩, where S is the class of possible domains forms a logical
system.
The main use of the set-theoretic semantics for PL0 is that we can use it to give meaning to concept
axioms, which we use to describe the “world”.
Concept Axioms
Observation: Set-theoretic semantics of ‘true’ and ‘false’(⊤:=φ ⊔ φ ⊥:=φ ⊓ φ)
[ ⊤]] = [ p]] ∪ [ p]] = [ p]] ∪ (D\ [ p]]) = D Analogously: [ ⊥]] = ∅
Idea: Use logical axioms to describe the world (Axioms restrict the class of
admissible domain structures)
Definition 18.2.3. A concept axiom is a PL0DL formula A that is assumed to be

true in the world.
Definition 18.2.4 (Set-Theoretic Semantics of Axioms). A is true in domain
D iff [ A]] = D.
Example 18.2.5. A world with three concepts and no concept axioms
concepts Set Semantics
sons daughters
child
daughter
son
children
Concept axioms are used to restrict the set of admissible domains to the intended ones. In our
situation, we require them to be true – as usual – which here means that they denote the whole
domain D.
Let us fortify our intuition about concept axioms with a simple example about the sibling relation.
We give four concept axioms and study their effect on the admissible models by looking at the
respective Venn diagrams. In the end we see that in all admissible models, the denotations of the
concepts son and daughter are disjoint, and child is the union of the two – just as intended.
Effects of Axioms to Siblings

Example 18.2.6. We can use concept axioms to describe the world from Exam-
ple 18.2.5.
Axioms Semantics
son ⊑ child
iff [ son]] ∪ [ child]] = D
iff [ son]] ⊆ [ child]]
sons daughters
daughter
⊑child
iff daughter ∪ [ child]] = D
iff [ daughter]] ⊆ [ child]] children
son ⊓ daughter sons daughters

child ⊑ son ⊔ daughter
The set-theoretic semantics introduced above is compatible with the regular semantics of proposi-
tional logic, therefore we have the same propositional identities. Their validity can be established
directly from the settings in Definition 18.2.2.
Propositional Identities
Name for ⊓ for ⊔
Idenpot. φ⊓φ=φ φ⊔φ=φ
Identity φ⊓⊤=φ φ⊔⊥=φ
Absorpt. φ⊔⊤=⊤ φ⊓⊥=⊥
Commut. φ⊓ψ =ψ⊓φ φ⊔ψ =ψ⊔φ
Assoc. φ⊓ψ⊓θ =φ⊓ψ⊓θ φ⊔ψ⊔θ =φ⊔ψ⊔θ
Distrib. φ ⊓ (ψ ⊔ θ) = φ ⊓ ψ ⊔ φ ⊓ θ φ ⊔ ψ ⊓ θ = (φ ⊔ ψ) ⊓ (φ ⊔ θ)
Absorpt. φ ⊓ (φ ⊔ θ) = φ φ⊔φ⊓θ =φ⊓θ
Morgan φ⊓ψ =φ⊔ψ φ⊔ψ =φ⊓ψ
dneg φ=φ
There is another way we can approach the set description interpretation of propositional logic: by
translation into a logic that can express knowledge about sets – first-order logic.
Set-Theoretic Semantics and Predicate Logic
Definition 18.2.7. Translation into PL1 (borrow semantics from that)

recursively add argument variable x
change back ⊓, ⊔, ⊑, ≡ to ∧, ∨, ⇒, ⇔
universal closure for x at formula level.
Definition Comment
pfo(x) :=p(x)
fo(x) fo(x)
A :=¬A
fo(x) fo(x) fo(x)
A⊓B :=A ∧B ∧ vs. ⊓
fo(x) fo(x) fo(x)
A⊔B :=A ∨B ∨ vs. ⊔
fo(x) fo(x) fo(x)
A⊑B :=A ⇒B ⇒ vs. ⊑
fo(x) fo(x) fo(x)
A=B :=A ⇔B ⇔ vs. =
fo fo(x)
A :=(∀x A ) for formulae
Normally, we embed PL0 into PL1 by mapping propositional

variables to atomic predicates and the connectives to them-
selves. The purpose of this embedding is to “talk about PL1 undecideable
truth/falsity of assertions”. For “talking about sets” we use
decideable
a non-standard embedding: propositional variables in PL0 are φ
mapped to first-order predicates, and the connectives to corre-  
 Xprop 7→pα→prop 
sponding set operations. This uses the convention that a set S φ :=

∧7→⊓
¬7→·

is represented by a unary predicate pS (its characteristic pred- PL0
icate), and set membership a∈S as pS (a).
Translation Examples
Example 18.2.8. We translate the concept axioms from Example 18.2.6 to fortify
our intuition:
fo
son ⊑ child = ∀x son(x) ⇒ child(x)
fo
daughter ⊑ child = ∀x daughter(x) ⇒ child(x)
fo
son ⊓ daughter = ∀x son(x) ∧ daughter(x)
fo
child ⊑ son ⊔ daughter = ∀x child(x) ⇒ (son(x) ∨ daughter(x))
What are the advantages of translation to PL1 ?

theoretically: A better understanding of the semantics
computationally: Description Logic Framework, but NOTHING for PL0
we can follow this pattern for richer description logics.
many tests are decidable for PL0 , but not for PL1 . (Description Logics?)
18.2.2 Ontologies and Description Logics

We have seen how sets of concept axioms can be used to describe the “world” by restricting the set
of admissible models. We want to call such concept descriptions “ontologies” – formal descriptions
of (classes of) objects and their relations.
Ontologies aka. “World Descriptions”

Definition 18.2.9 (Classical). An ontology is a representation of the types, prop-
erties, and interrelationships of the entities that really or fundamentally exist for a
particular domain of discourse.
Remark: Definition 18.2.9 is very general, and depends on what we mean by

“representation”, “entities”, “types”, and “interrelationships”.
This may be a feature, and not a bug, since we can use the same intuitions across
a variety of representations.
Definition 18.2.10. An ontology consists of a logical system ⟨L, K, |=⟩ and con-
cept axioms (expressed in L) about
individuals: concrete instances of object in the domain,
concepts: classes of individuals that share properties and aspects, and
relations: ways in which concept and individuals can be related to one another.
Example 18.2.11. Semantic networks are ontologies. (relatively informal)
Example 18.2.12. PL0DL is an ontology format. (formal, but relatively weak)

Example 18.2.13. PL1 is an ontology format as well. (formal, expressive)
As we will see, the situation for PL0DL is typical for formal ontologies (even though it only offers
concepts), so we state the general description logic paradigm for ontologies. The important idea
is that having a formal system as an ontology format allows us to capture, study, and implement
ontological inference.
The Description Logic Paradigm

Idea: Build a whole family of logics for describing sets and their relations. (tailor
their expressivity and computational properties)
Definition 18.2.14. A description logic is a formal system for talking about col-
lections of objects and their relations that is at least as expressive as PL0 with
set-theoretic semantics and offers individuals and relations.
A description logic has the following four components:
a formal language L with logical con-
PL1 undecideable
stants ⊓, ·, ⊔, ⊑, and ≡,
ψ decideable
a set-theoretic semantics ⟨D, [ ·]]⟩,
 C7→p∈Σp
 
DL 1 
a translation into first-order logic that is ψ := ⊓7→∩
 ·7→D\· 
compatible with ⟨D, [ ·]]⟩, and φ  
 X∈V0 7→C 
a calculus for L that induces a decision φ := ∧7→⊓
PL0  ¬7→· 
procedure for L-satisfiability.
Definition 18.2.15. Given a description logic D, a D ontology consists of

a terminology (or TBox): concepts and roles and a set of concept axioms that
describe them, and
sassertion (or ABox): a set of individuals and statements about concept mem-
bership and role relationships for them.
For convenience we add concept definitions as a mechanism for defining new concepts from old
ones. The so-defined concepts inherit the properties from the concepts they are defined from.
TBoxes in Description Logics

Let D be a description logic with concepts C.
Definition 18.2.16. A concept definition is a pair c=C, where c is a new concept

name and C∈C is a D-formula.
Definition 18.2.17. A concept definition c=C is called recursive, iff c occurs in
C.
Example 18.2.18. We can define mother=woman ⊓ has_child.

Definition 18.2.19. An TBox is a finite set of concept definitions and concept
axioms. It is called acyclic, iff it does not contain recursive definitions.
Definition 18.2.20. A formula A is called normalized wrt. an TBox T , iff it does

not contain concepts defined in T . (convenient)
Definition 18.2.21 (Algorithm). (for arbitrary DLs)
Input: A formula A and a TBox T .
While [A contains concept c and T a concept definition c=C]

substitute c by C in A.
Lemma 18.2.22. This algorithm terminates for acyclic TBoxes, but results can be
exponentially large.
As PL0DL does not offer any guidance on this, we will leave the discussion of ABoxes to subsec-
tion 18.3.3 when we have introduced our first proper description logic ALC.
18.2.3 Description Logics and Inference

Now that we have established the description logic paradigm, we will have a look at the
inference services that can be offered on this basis.
Before we go into details of particular description logics, we must ask ourselves what kind of
inference support we would want for building systems that support knowledge workers in building,
maintaining and using ontologies. An example of such a system is the Protégé system [Pro], which
can serve for guiding our intuition.
Kinds of Inference in Description Logics

Definition 18.2.23. Ontology systems employ three main reasoning services:
Consistency test: is a concept definition satisfiable?
Subsumption test: does a concept subsume another?
Instance test: is an individual an example of a concept?
Problem: decidability, complexity, algorithm
We will now through these inference-based tests separately.

The consistency test checks for concepts that do not/cannot have instances. We want to avoid such
concepts in our ontologies, since they clutter the namespace and do not contribute any meaningful
contribution.
Consistency Test
Example 18.2.24 (T-Box).
man = person ⊓ has_Y person with y-chromosome

woman = person ⊓ has_Y person without y-chromosome
hermaphrodite = man ⊓ woman man and woman
This specification is inconsistent, i.e. [ hermaphrodite]] = ∅ for all D, φ.

Algorithm: Propositional satisfiability test (NP complete)

we know how to do this, e.g. tableau, resolution.
Even though consistency in our example seems trivial, large ontologies can make machine support
necessary. This is even more true for ontologies that change over time. Say that an ontology
initially has the concept definitions woman=person⊓long_hair and man=person⊓bearded, and then
is modernized to a more biologically correct state. In the initial version the concept hermaphrodite
is consistent, but becomes inconsistent after the renovation; the authors of the renovation should
be made aware of this by the system.
The subsumption test determines whether the sets denoted by two concepts are in a subset relation.
The main justification for this is that humans tend to be aware of concept subsumption, and tend
to think in taxonomytaxonomic hierarchies. To cater to this, the subsumption test is useful.
Subsumption Test
Example 18.2.25. in this case trivial
axiom entailed subsumption relation

man = person ⊓ has_Y man ⊑ person
woman = person ⊓ has_Y woman ⊑ person
Reduction to consistency test: (need to implement only one)

Axioms ⇒ (A ⇒ B) is valid iff Axioms ∧ A ∧ ¬B is consistentin.
Definition 18.2.26. A subsumes B (modulo an axiom set A)
iff [ B]] ⊆ [ A]] for all interpretations D, that satisfy A
iff A ⇒ B ⇒ A is valid.
In our example: person subsumes woman and man
The good news is that we can reduce the subsumption test to the consistency test, so we can
re-use our existing implementation.
The main user-visible service of the subsumption test is to compute the actual taxonomy induced
by an ontology.
Classification
The subsumption relation among all concepts (subsumption graph)
Visualization of the subsumption graph for inspection (plausibility)
Definition 18.2.27. Classification is the computation of the subsumption graph.
Example 18.2.28. (not always so trivial)

object
person
man woman student professor child
male_student female_student boy girl
If we take stock of what we have developed so far, then we can see PL0DL as a rational recon-
struction of semantic networks restricted to the “isa” relation. We relegate the “instance” relation
to subsection 18.3.3.
This reconstruction can now be used as a basis on which we can extend the expressivity and
inference procedures without running into problems.
18.3 A simple Description Logic: ALC

In this section, we instantiate the description-logic paradigm further with the prototypical logic
ALC, which we will introduce now.
18.3.1 Basic ALC: Concepts, Roles, and Quantification

In this subsection, we instantiate the description-logic paradigm with the prototypical logic
ALC, which we will develop now.
Motivation for ALC (Prototype Description Logic)
Propositional logic (PL0 ) is not expressive enough

Example 18.3.1. “mothers are women that have a child”
Reason: there are no quantifiers in PL0 (existential (∃) and universal (∀))
Idea: Use first-order predicate logic (PL1 )
∀x mother(x) ⇔ (woman(x) ∧ (∃y has_child(x, y)))
Problem: Complex algorithms, non termination (PL1 is too expressive)

Idea: Try to travel the middle ground
More expressive than PL0 (quantifiers) but weaker than PL1 . (still tractable)
Technique: Allow only “restricted quantification”, where quantified variables only
range over values that can be reached via a binary relation like has_child.
ALC extends the concept operators of PL0DL with binary relations (called “roles” in ALC). This
gives ALC the expressive power we had for the basic semantic networks from ??.
18.3. A SIMPLE DESCRIPTION LOGIC: ALC 321
Syntax of ALC
Definition 18.3.2 (Concepts). (aka. “predicates” in PL1 or “propositional

variables” in PL0DL ) concepts in DLs name classes of objects like in OOP.
Definition 18.3.3 (Special concepts). The top concept ⊤ (for “true” or “all”)
and the bottom concept ⊥ (for “false” or “none”).
Example 18.3.4. person, woman, man, mother, professor, student, car, BMW,
computer, computer program, heart attack risk, furniture, table, leg of a chair, . . .
Definition 18.3.5. Roles name binary relations (like in PL1 )
Example 18.3.6. has_child, has_son, has_daughter, loves, hates, gives_course,
executes_computer_program, has_leg_of_table, has_wheel, has_motor, . . .
ALC restricts the quantifications to range all individuals reachable as role successors. The dis-
tinction between universal and existential quantifiers clarifies an implicit ambiguity in semantic
networks.
Syntax of ALC: Formulae FALC
Definition 18.3.7 (Grammar). FALC ::=C | ⊤ | ⊥ | FALC | FALC ⊓ FALC | FALC ⊔ FALC |
∃R FALC | ∀R FALC
Example 18.3.8.
person ⊓ ∃has_child student (parents of students) (The set of persons that

have a child which is a student)
person ⊓ ∃has_child ∃has_child student (grandparents of students)
person ⊓ ∃has_child ∃has_child (student ⊔ teacher) (grandparents of students
or teachers)
person ⊓ ∀has_child student (parents whose children are all students)
person ⊓ ∀haschild ∃has_child student (grandparents, whose children all have at
least one child that is a student)
More ALC Examples
Example 18.3.9. car ⊓ ∃has_part ∃made_in EU (cars that have at least one part
that has not been made in the EU)
Example 18.3.10. student ⊓ ∀audits_course graduatelevelcourse (students, that

only audit graduate level courses)
Example 18.3.11. house ⊓ ∀has_parking off_street (houses with off-street
parking)
Note: p ⊑ q can still be used as an abbreviation for p ⊔ q.

Example 18.3.12. student ⊓ ∀audits_course (∃hastutorial ⊤ ⊑ ∀has_TA woman)
(students that only audit courses that either have no tutorial or tutorials that are
TAed by women)
As before we allow concept definitions so that we can express new concepts from old ones, and
obtain more concise descriptions.
ALC Concept Definitions

Idea: Define new concepts from known ones.
Definition 18.3.13. A concept definition is a pair consisting of a new concept
name (the definiendum) and an ALC formula (the definiens). Concept names are
not definienda are called primitive.
We extend the ALC grammar from Definition 18.3.7 by the production CDALC ::=C =
FALC .
Example 18.3.14.
Definition rec?
man = person ⊓ ∃has_chrom Y_chrom -
woman = person ⊓ ∀has_chrom Y_chrom -
mother = woman ⊓ ∃has_child person -
father = man ⊓ ∃has_child person -
grandparent = person ⊓ ∃has_child (mother ⊔ father) -
german = person ⊓ ∃has_parents german +
number_list = empty_list ⊔ ∃is_first number ⊓ ∃is_rest number_list +
As before, we can normalize a TBox by definition expansion if it is acyclic. With the introduction
of roles and quantification, concept definitions in ALC have a more “interesting” way to be cyclic
as Observation 18.3.19 shows.
TBox Normalization in ALC

Definition 18.3.15. We call an ALC formula φ normalized wrt. a set of concept
definitions, iff all concept names occurring in φ are primitive.
Definition 18.3.16. Given a set D of concept definitions, normalization is the
process of replacing in an ALC formula φ all occurrences of definienda in D with
their definientia.
Example 18.3.17 (Normalizing grandparent).
grandparent
7→ person ⊓ ∃has_child (mother ⊔ father)
7→ person ⊓ ∃has_child (woman ⊓ ∃has_child person ⊓ man ⊓ ∃has_child person)
7→ person ⊓ ∃has_child (person ⊓ ∃has_chrom Y_chrom ⊓ ∃has_child person ⊓ person ⊓ ∃has_chrom Y_chrom ⊓ ∃has_child person)
Observation 18.3.18. Normalization results can be exponential. (contain

redundancies)
Observation 18.3.19. Normalization need not terminate on cyclic TBoxes.
Example 18.3.20.
german 7→ person ⊓ ∃has_parents german

7→ person ⊓ ∃has_parents (person ⊓ ∃has_parents german)
7→ ...
Now that we have motivated and fixed the syntax of ALC, we will give it a formal semantics.
The semantics of ALC is an extension of the set-theoretic semantics for PL0 , thus the interpretation
[[·]] assigns subsets of the domain to concepts and binary relations over the domain to roles.
Semantics of ALC
ALC semantics is an extension of the set-semantics of propositional logic.
Definition 18.3.21. A model for ALC is a pair ⟨D, [[·]]⟩, where D is a non-empty
set called the domain and [[·]] a mapping called the interpretation, such that
Op. formula semantics

[ c]] ⊆ D = [ ⊤]] [ ⊥]] = ∅ [ r]] ⊆ D × D
· [ φ]] = [ φ]] = D\ [ φ]]
⊓ [ φ ⊓ ψ]] = [ φ]] ∩ [ ψ]]
⊔ [ φ ⊔ ψ]] = [ φ]] ∪ [ ψ]]
∃R [ ∃R φ]] = {x ∈ D | ∃y ⟨x, y⟩∈ [ R]] and y∈ [ φ]]}
∀R [ ∀R φ]] = {x ∈ D | ∀y if ⟨x, y⟩∈ [ R]] then y∈ [ φ]]}
Alternatively we can define the semantics of ALC by translation into PL1 .

Definition 18.3.22. The translation of ALC into PL1 extends the one from Defini-
tion 18.2.7 by the following quantifier rules:
fo(x) fo(x)
∀R φ :=(∀y R(x, y) ⇒ φfo(y) ) ∃R φ :=(∃y R(x, y) ∧ φfo(y) )
Observation 18.3.23. The set-theoretic semantics from Definition 18.3.21 and

the “semantics-by-translation” from Definition 18.3.22 induce the same notion of
satisfiability.
We can now use the ALC identities above to establish a useful normal form for ALC. This will
play a role in the inference procedures we study next.
The following identitieswill be useful later on. They can be proven directly with the settings from
Definition 18.3.21; we carry this out for one of them below.
ALC Identities
1 ∃R φ = ∀R φ 3 ∀R φ = ∃R φ

2 ∀R (φ ⊓ ψ) = ∀R φ ⊓ ∀R ψ 4 ∃R (φ ⊔ ψ) = ∃R φ ⊔ ∃R ψ
Proof of 1

∃R φ = D\ [ ∃R φ]] = D\{x ∈ D | ∃y (⟨x, y⟩∈ [ R]]) and (y∈ [ φ]])}
= {x ∈ D | not ∃y (⟨x, y⟩∈ [ R]]) and (y∈ [ φ]])}
= {x ∈ D | ∀y if (⟨x, y⟩∈ [ R]]) then (y̸∈ [ φ]])}
= {x ∈ D | ∀y if (⟨x, y⟩∈ [ R]]) then (y∈(D\ [ φ]]))}
= {x ∈ D | ∀y if (⟨x, y⟩∈ [ R]]) then (y∈ [ φ]])}
= [ ∀R φ]]
The form of the identities (interchanging quantification with connectives) is reminiscient of identi-
ties in PL1 ; this is no coincidence as the “semantics by translation” of Definition 18.3.22 shows.
Negation Normal Form

Definition 18.3.24 (NNF). An ALC formula is in negation normal form (NNF),
iff · is only applied to concept names.
Use the ALC identities as rules to compute it. (in linear time)
example by rule
∃R (∀S e ⊓ ∀S d)
7→ ∀R ∀S e ⊓ ∀S d ∃R φ 7→ ∀R φ
7→ ∀R (∀S e ⊔ ∀S d) φ ⊓ ψ 7→ φ ⊔ ψ
7→ ∀R (∃S e ⊔ ∀S d) ∀R φ 7→ ∀R φ
7→ ∀R (∃S e ⊔ ∀S d) φ 7→ φ
Finally, we extend ALC with an ABox component. This mainly means that we define two new
assertions in ALC and specify their semantics and PL1 translation.
ALC with Assertions about Individuals

Definition 18.3.25. We define the assertions for ALC
a:φ (a is a φ)
aRb (a stands in relation R to b)
assertions make up the ABox in ALC.
Definition 18.3.26. Let ⟨D, [[·]]⟩ be a model for ALC, then we define
[ a:φ]] = T, iff [ a]] ∈ [ φ]], and
[ a R b]] = T, iff ( [ a]] , [ b]] )∈ [ R]].

Definition 18.3.27. We extend the PL1 translation of ALC to ALC assertions:
a:φfo :=φfo(a) , and
fo
a R b :=R(a, b).
If we take stock of what we have developed so far, then we see that ALC as a rational recon-
struction of semantic networks restricted to the “isa” and “instance” relations – which are the only
ones that can really be given a denotational and operational semantics.
18.3.2 Inference for ALC

Video Nuggets covering this subsection can be found at https://fau.tv/clip/id/27301 and
https://fau.tv/clip/id/27302. In this subsection we make good on the motivation from
?? that description logics enjoy tractable inference procedures: We present a tableau calculus for
ALC, show that it is a decision procedure, and study its complexity.
TALC : A Tableau-Calculus for ALC

Recap Tableaux: A tableau calculus develops an initial tableau in a tree-formed
scheme using tableau extension rules.
A saturated tableau (no rules applicable) constitutes a refutation, if all branches are
closed (end in ⊥).
Definition 18.3.28. The ALC tableau calculus TALC acts on assertions
x:φ (x inhabits concept φ)
xRy (x and y are in relation R)
where φ is a normalized ALC concept in negation normal form with the following
rules:
x:c x:∀R φ
x:c x:φ ⊓ ψ x:φ ⊔ ψ xRy x:∃R φ
T⊥ T⊓ T⊔ T∀ T∃
⊥ x:φ y:φ xRy
x:φ x:ψ
x:ψ y:φ
To test consistency of a concept φ, normalize φ to ψ, initialize the tableau with

x:ψ, saturate. Open branches ; consistent. (x arbitrary)
In contrast to the tableau calculi for theorem proving we have studied earlier, TALC is run in “model
generation mode”. Instead of initializing the tableau with the axioms and the negated conjecture
and hope that all branches will close, we initialize the TALC tableau with axioms and the “conjecture”
that a given concept φ is satisfiable – i.e. φ h as a member x, and hope for branches that are
open, i.e. that make the conjecture true (and at the same time give a model).
Let us now work through two very simple examples; one unsatisfiable, and a satisfiable one.
TALC Examples
Example 18.3.29. We have two similar conjectures about children.
x:∀has_child man ⊓ ∃has_child man (all sons, but a daughter)
x:∀has_child man ⊓ ∃has_child man (only sons, and at least one)
Example 18.3.30 (Tableau Proof).
1 x:∀has_child man ⊓ ∃has_child man initial x:∀has_child man ⊓ ∃has_child man initial
2 x:∀has_child man T⊓ x:∀has_child man T⊓
3 x:∃has_child man T⊓ x:∃has_child man T⊓
4 x has_child y T∃ x has_child y T∃
5 y:man T∃ y:man T∃
6 y:man T∀ open
7 ⊥ T⊥
inconsistent
The right tableau has a model: there are two persons, x and y. y is the only child
of x, y is a man
Another example: this one is more complex, but the concept is satisfiable.
Another TALC Example
Example 18.3.31. ∀has_child (ugrad ⊔ grad) ⊓ ∃has_child ugrad is satisfiable.
Let’s try it on the board

Tableau proof for the notes
1 x:∀has_child (ugrad ⊔ grad) ⊓ ∃has_child ugrad
2 x:∀has_child (ugrad ⊔ grad) T⊓
3 x:∃has_child ugrad T⊓
4 x has_child y T∃
5 y:ugrad T∃
6 y:ugrad ⊔ grad T∀
7 y:ugrad y:grad T⊔
8 ⊥ open
The left branch is closed, the right one represents a model: y is a child of x, y
is a graduate student, x hat exactly one child: y.
After we got an intuition about TALC , we can now study the properties of the calculus to determine
that it is a decision procedure for ALC.
Properties of Tableau Calculi

We study the following properties of a tableau calculus C:

Termination: there are no infinite sequences of rule applications.
Correctness: If φ is satisfiable, then C terminates with an open branch.
Completeness: If φ is in unsatisfiable, then C terminates and all branches are
closed.
Complexity of the algorithm (time and space complexity).
Additionally, we are interested in the complexity of the satisfiability itself (as a
benchmark)
The correctness result for TALC is as usual: we start with a model of x:φ and show that an TALC
tableau must have an open branch.
Correctness
Lemma 18.3.32. If φ satisfiable, then TALC terminates on x:φ with open branch.
Proof: Let M:=⟨D, [ ·]]⟩ be a model for φ and w∈ [ φ]].
I|=(x:ψ) iff [[x]] ∈ [[ψ]]
1. We define [ x]] :=w and I|=x R y iff ⟨x, y⟩∈ [[R]]
I|=S iff I|=c for all c∈S
2. This gives us M|=(x:φ) (base case)
3. If the branch is satisfiable, then either
no rule applicable to leaf, (open branch)
or rule applicable and one new branch satisfiable. (inductive case)
4. There must be an open branch. (by termination)
We complete the proof by looking at all the TALC inference rules in turn.
Case analysis on the rules

T⊓ applies then I|=(x:φ ⊓ ψ), i.e. [ x]] ∈ [ φ ⊓ ψ]]
so [ x]] ∈ [ φ]] and [ x]] ∈ [ ψ]], thus I|=(x:φ) and I|=(x:ψ).
T⊔ applies then I|=(x:φ ⊔ ψ), i.e [ x]] ∈ [ φ ⊔ ψ]]

so [ x]] ∈ [ φ]] or [ x]] ∈ [ ψ]], thus I|=(x:φ) or I|=(x:ψ),
wlog. I|=(x:φ).
T∀ applies then I|=(x:∀R φ) and I|=x R y, i.e. [ x]] ∈ [ ∀R φ]] and ⟨x, y⟩∈ [ R]], so
[ y]] ∈ [ φ]]
T∃ applies then I|=(x:∃R φ), i.e [ x]] ∈ [ ∃R φ]],
so there is a v∈D with ⟨ [ x]] , v⟩∈ [ R]] and v∈ [ φ]].
We define [ y]] :=v, then I|=x R y and I|=(y:φ)
For the completeness result for TALC we have to start with an open tableau branch and construct at
model that satisfies all judgements in the branch. We proceed by building a Herbrand model, whose
domain consists of all the individuals mentioned in the branch and which interprets all concepts
and roles as specified in the branch. Not surprisingly, the model thus constructed satisfies the
branch.
Completeness of the Tableau Calculus

Lemma 18.3.33. Open saturated tableau branches for φ induce models for φ.
Proof: construct a model for the branch and verify for φ

1. Let B be an open saturated branch
we define
D : = {x|x:ψ∈B or z R x∈B}
[ c]] : = {x|x:c∈B}
[ R]] : = {⟨x, y⟩|x R y∈B}
well-defined since never x:c, x:c∈B (otherwise T⊥ applies)

M satisfies all constraints x:c, x:c and x R y, (by construction)
2. M|=(y:ψ), for all y:ψ∈B (on k = size(ψ) next slide)
3. M|=(x:φ).
We complete the proof by looking at all the TALC inference rules in turn.
Case Analysis for Induction
case y:ψ = y:ψ 1 ⊓ ψ 2 Then {y:ψ 1 , y:ψ 2 } ⊆ B (T⊓ -rule, saturation) so M|=(y:ψ 1 )
and M|=(y:ψ 2 ) and M|=(y:ψ 1 ⊓ ψ 2 ) (IH, Definition)
case y:ψ = y:ψ 1 ⊔ ψ 2 Then y:ψ 1 ∈B or y:ψ 2 ∈B (T⊔ , saturation) so M|=(y:ψ 1 ) or
M|=(y:ψ 2 ) and M|=(y:ψ 1 ⊔ ψ 2 ) (IH, Definition)
case y:ψ = y:∃R θ then {y R z, z:θ} ⊆ B (z new variable) (T∃ -rules, saturation) so
M|=(z:θ) and M|=y R z, thus M|=(y:∃R θ). (IH, Definition)
case y:ψ = y:∀R θ Let ⟨ [ y]] , v⟩∈ [ R]] for some r∈D
then v = z for some variable z with y R z∈B (construction of [ R]]) So z:θ∈B and
M|=(z:θ). (T∀ -rule, saturation, Def) Since v was arbitrary we have M|=(y:∀R θ).
Termination
Theorem 18.3.34. TALC terminates
To prove termination of a tableau algorithm, find a well-founded measure (function)
that is decreased by all rules

x:c x:∀R φ
x:c x:φ ⊓ ψ x:φ ⊔ ψ xRy x:∃R φ
T⊥ T⊓ T⊔ T∀ T∃
xRy

⊥ x:φ x:φ x:ψ
y:φ
x:ψ y:φ
Proof: Sketch (full proof very technical)

1. Any rule except T∀ can only be applied once to x:ψ.
2. Rule T∀ applicable to x:∀R ψ at most as the number of R-successors of x.
(those y with x R y above)
3. The R-successors are generated by x:∃R θ above, (number bounded by size of
input formula)
4. Every rule application to x:ψ generates constraints z:ψ ′ , where ψ ′ a proper
sub-formula of ψ.
We can turn the termination result into a worst-case complexity result by examining the sizes of
branches.
Complexity
Idea: Work of tableau branches one after the other. (Branch size =
b space
complexity)
Observation 18.3.35. The size of the branches is polynomial in the size of the
input formula:
branch size = |input formulae| + #(∃-formulae) · #(∀-formulae)
Proof sketch: Re-examine the termination proof and count: the first summand
comes from Proof step 4., the second one from Proof step 3. and Proof step 2.
Theorem 18.3.36. The satisfiability problem for ALC is in PSPACE.
Theorem 18.3.37. The satisfiability problem for ALC is PSPACE-Complete.
Proof sketch: Reduce a PSPACE-complete problem to ALC-satisfiability
Theorem 18.3.38 (Time Complexity).

The ALC satisfiability problem is in EXPTIME.
Proof sketch: There can be exponentially many branches (already for PL0 )
In summary, the theoretical complexity of ALC is the same as that for PL0 , but in practice ALC is
much more expressive. So this is a clear win.
But the description of the tableau algorithm TALC is still quite abstract, so we look at an
exemplary implementation in a functional programmingfunctional programming language.
The functional Algorithm for ALC

Observation: (leads to a better treatment for ∃)

the T∃ -rule generates the constraints x R y and y:ψ from x:∃R ψ
this triggers the T∀ -rule for x:∀R θi , which generate y:θ1 , . . . , y:θn
for y we have y:ψ and y:θ1 , . . . , y:θn . (do all of this in a single step)
we are only interested in non-emptiness, not in particular witnesses (leave them
out)
Definition 18.3.39. The functional algorithm for TALC is
consistent(S) =
if {c, c} ⊆ S then false
elif ‘φ ⊓ ψ’∈S and (‘φ’̸∈S or ‘ψ’̸∈S)
then consistent(S ∪ {φ, ψ})
elif ‘φ ⊔ ψ’∈S and {φ, ψ}̸∈S
then consistent(S ∪ {φ}) or consistent(S ∪ {ψ})
elif forall ‘∃R ψ’∈S
consistent({ψ} ∪ {θ ∈ θ | ‘∀R θ’∈S})
else true
Relatively simple to implement. (good implementations optimized)

But: This is restricted to ALC. (extension to other DL difficult)
Note that we have (so far) only considered an empty TBox: we have initialized the tableau
with a normalized concept; so we did not need to include the concept definitions. To cover “real”
ontologies, we need to consider the case of concept axioms as well.
We now extend TALC with concept axioms. The key idea here is to realize that the concept axioms
apply to all individuals. As the individuals are generated by the T∃ rule, we can simply extend
that rule to apply all the concepts axioms to the newly introduced individual.
Extending the Tableau Algorithm by Concept Axioms

Concept axioms, e.g. child ⊑ son ⊔ daughter cannot be handled in TALC yet.
Idea: Whenever a new variable y is introduced (by T∃ -rule) add the information
that axioms hold for y.
Initialize tableau with {x:φ} ∪ CA (CA : = set of concept axioms)
x:∃R φ CA = {α1 , . . ., αn } ∃
New rule for ∃: TCA (instead of T∃ )
y:φ
xRy
y:α1
..
.
y:αn
Problem: CA:={∃R c} and start tableau with x:d (non-termination)
The problem of this approach is that it spoils termination, since we cannot control the number of
rule applications by (fixed) properties of the input formulae. The example shows this very nicely.
We only sketch a path towards a solution.
Non-Termination of TALC with Concept Axioms
Problem: CA:={∃R c} and start tableau with x:d. (non-termination)
x:d start
x:∃R c in CA Solution: Loop-Check:
x R y1 T∃
Instead of a new variable y take an old
y 1 :c T∃
variable z, if we can guarantee that what-
y 1 :∃R c TC∃A
ever holds for y already holds for z.
y1 R y2 T∃
y 2 :c T∃ We can only do this, iff the T∀ -rule has
y 2 :∃R c TC∃A been exhaustively applied.
...
Theorem 18.3.40. The consistency problem of ALC with concept axioms is decid-
able.
Proof sketch: TALC with a suitable loop check terminates.
18.3.3 ABoxes, Instance Testing, and ALC

Now that we have a decision problem for ALC with concept axioms, we can go the final step to
the general case of inference in description logics: we add an ABox with assertional axioms that
describe the individuals.
We will now extend the description logic ALC with assertions that
Instance Test: Concept Membership

Definition 18.3.41. An instance test computes whether given an ALC ontology an
individual is a member of a given class.
Example 18.3.42 (An Ontology).
TBox (terminological Box) ABox (assertional Box, data base)

woman = person ⊓ has_Y tony:person Tony is a person
man = person ⊓ has_Y tony:has_Y Tony has a y-chrom
This entails: tony:man (Tony is a man).

Problem: Can we compute this?
If we combine classification with the instance test, then we get the full picture of how concepts
and individuals relate to each other. We see that we get the full expressivity of semantic networks
in ALC.
Realization
Definition 18.3.43. Realization is the computation of all instance relations be-
tween ABox objects and TBox concepts.
Observation: It is sufficient to remember the lowest concepts in the subsumption

graph. (rest by subsumption)
object
person
man woman student professor child
male_student female_student girl boy
Tony Terry Timmy
Example 18.3.44. If tony:male_student is known, we do not need tony:man.
Let us now get an intuition on what kinds of interactions between the various parts of an ontology.
ABox Inference in ALC: Phenomena

There are different kinds of interactions between TBox and ABox in ALC and in
description logics in general.
Example 18.3.45.
property example
internally inconsistent tony:student, tony:student
TBox: student ⊓ prof
inconsistent with a TBox
ABox: tony:student, tony:prof
ABox: tony:∀has_grad genius
implicit info that is not explicit tony has_grad mary
|= mary:genius
TBox: happy_prof = prof ⊓ ∀has_grad genius
ABox: tony:happy_prof,
information that can be com-
tony has_grad mary
bined with TBox info
|= mary:genius
Again, we ask ourselves whether all of these are computable.

Fortunately, it is very simple to add assertions to TALC . In fact, we do not have to change anything,
as the judgments used in the tableau are already of the form of ABox assertionss.
Tableau-based Instance Test and Realization

Query: Do the ABox and TBox together entail a:φ? (a∈φ?)
Algorithm: Test a:φ for consistency with ABox and TBox. (use our tableau)
18.4. DESCRIPTION LOGICS AND THE SEMANTIC WEB 333
Necessary changes: (no big deal)

Normalize ABox wrt. TBox. (definition expansion)
Initialize the tableau with ABox in NNF. (so it can be used)
Example 18.3.46.
Example: add mary:genius to determine ABox, T Box |= mary:genius

TBox happy_prof = prof ⊓
∀has_grad genius tony:prof ⊓ ∀has_grad genius TBox
tony has_grad mary ABox
mary:genius Query
tony:prof T⊓
tony:happy_prof tony:∀has_grad genius T⊓
ABox tony has_grad mary mary:genius T∀
⊥ T⊥
Note: The instance test is the base for realization. (remember?)

Idea: Extend to more complex ABox queries. (e.g. give me all instances of φ)
This completes our investigation of inference for ALC. We summarize that ALC is a logic-based
ontology language where the inference problems are all decidable/computable via TALC . But of
course, while we have reached the expressivity of basic semantic networks, there are still things
that we cannot express in ALC, so we will try to extend ALC without losing decidability/com-
putability.
18.4 Description Logics and the Semantic Web

this section we discuss how we can apply description logics in the real world, in particular, as
a conceptual and algorithmic basis of the semantic web. That tries to transform the World
Wide Web from a human-understandable web of multimedia documents into a “web of machine-
understandable data”. In this context, “machine-understandable” means that machines can draw
inferences from data they have access to.
Note that the discussion in this digression is not a full-blown introduction to RDF and OWL,
we leave that to [SR14; Her+13a; Hit+12] and the respective W3C recommendations. Instead
we introduce the ideas behind the mappings from a perspective of the description logics we have
discussed above.
The most important component of the semantic web is a standardized language that can represent
“data” about information on the Web in a machine-oriented way.
Resource Description Framework

Definition 18.4.1. The Resource Description Framework (RDF) is a framework for
describing resources on the web. It is an XML vocabulary developed by the W3C.
Note: RDF is designed to be read and understood by computers, not to be

displayed to people. (it shows)
Example 18.4.2. RDF can be used for describing (all “objects on the WWW”)
properties for shopping items, such as price and availability
time schedules for web events

information about web pages (content, author, created and modified date)
content and rating for web pictures
content for search engines
electronic libraries
Note that all these examples have in common that they are about “objects on the Web”, which is
an aspect we will come to now.
“Objects on the Web” are traditionally called “resources”, rather than defining them by their
intrinsic properties – which would be ambitious and prone to change – we take an external property
to define them: everything that has a URI is a web resource. This has repercussions on the design
of RDF.
Resources and URIs

RDF describes resources with properties and property values.
RDF uses Web identifiers (URIs) to identify resources.
Definition 18.4.3. A resource is anything that can have a URI, such as http:
//www.fau.de.
Definition 18.4.4. A property is a resource that has a name, such as author

or homepage, and a property value is the value of a property, such as Michael
Kohlhase or http://kwarc.info/kohlhase. (a property value can be another
resource)
Definition 18.4.5. A RDF statement s (also known as a triple) consists of a
resource (the subject of s), a property (the predicate of s), and a property value
(the object of s). A set of RDF triples is called an RDF graph.
Example 18.4.6. Statement: [This slide]subj has been [author]pred ed by [Michael
Kohlhase]obj
The crucial observation here is that if we map “subjects” and “objects” to “individuals”, and
“predicates” to “relations”, the RDF triples are just relational ABox statements of description
logics. As a consequence, the techniques we developed apply.
Note:
Actually, a RDF graph is technically a labeled multigraph, which allows multiple edges between
any two nodes (the resources) and where nodes and edges are labeled by URIs.
We now come to the concrete syntax of RDF. This is a relatively conventional XML syntax that
combines RDF statements with a common subject into a single “description” of that resource.
XML Syntax for RDF

RDF is a concrete XML vocabulary for writing statements
Example 18.4.7. The following RDF document could describe the slides as a
resource
<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22−rdf−syntax−ns#"
xmlns:dc= "http://purl.org/dc/elements/1.1/">
<rdf:Description about="https://.../CompLog/kr/en/rdf.tex">
<dc:creator>Michael Kohlhase</dc:creator>
<dc:source>http://www.w3schools.com/rdf</dc:source>
</rdf:Description>
</rdf:RDF>
This RDF document makes two statements:

The subject of both is given in the about attribute of the rdf:Description element
The predicates are given by the element names of its children
The objects are given in the elements as URIs or literal content.
Intuitively: RDF is a web scalable way to write down ABox information.
Note that XML namespaces play a crucial role in using element to encode the predicate URIs.
Recall that an element name is a qualified name that consists of a namespace URI and a proper
element name (without a colon character). Concatenating them gives a URI in our example the
predicate URI induced by the dc:creator element is http://purl.org/dc/elements/1.1/creator.
Note that as URIs go RDF URIs do not have to be URLs, but this one is and it references (is
redirected to) the relevant part of the Dublin Core elements specification [DCM12].
RDF was deliberately designed as a standoff markup format, where URIs are used to annotate
web resources by pointing to them, so that it can be used to give information about web resources
without having to change them. But this also creates maintenance problems, since web resources
may change or be deleted without warning.
RDFa gives authors a way to embed RDF triples into web resources and make keeping RDF
statements about them more in sync.
RDFa as an Inline RDF Markup Format

Problem: RDF is a standoff markup format (annotate by URIs pointing into
other files)
Example 18.4.8.
<div xmlns:dc="http://purl.org/dc/elements/1.1/" id="address">

<h2 about="#address" property="dc:title">RDF as an Inline RDF Markup Format</h2>
<h3 about="#address" property="dc:creator">Michael Kohlhase</h3>
<em about="#address" property="dc:date" datatype="xsd:date"
content="2009−11−11">November 11., 2009</em>
</div>
https://svn.kwarc.info/.../CompLog/kr/slides/rdfa.tex
http://purl.org/dc/elements/1.1/title
http://purl.org/dc/elements/1.1/date
http://purl.org/dc/elements/1.1/creator
RDFa as an Inline RDF Markup Format
2009−11−11 (xsd:date)
Michael Kohlhase
In the example above, the about and property attribute are reserved by RDFa and specify the
subject and predicate of the RDF statement. The object consists of the body of the element,
unless otherwise specified e.g. by the resource attribute.
Let us now come back to the fact that RDF is just an XML syntax for ABox statements.
RDF as an ABox Language for the Semantic Web

Idea: RDF triples are ABox entries h R s or h:φ.
Example 18.4.9. h is the resource for Ian Horrocks, s is the resource for Ulrike
Sattler, R is the relation “hasColleague”, and φ is the class foaf:Person
<rdf:Description about="some.uri/person/ian_horrocks">
<rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/>
<hasColleague resource="some.uri/person/uli_sattler"/>
</rdf:Description>
Idea: Now, we need an similar language for TBoxes (based on ALC)
In this situation, we want a standardized representation language for TBox information; OWL
does just that: it standardizes a set of knowledge representation primitives and specifies a variety
of concrete syntaxes for them. OWL is designed to be compatible with RDF, so that the two
together can form an ontology language for the web.
OWL as an Ontology Language for the Semantic Web

Task: Complement RDF (ABox) with a TBox language.
Idea: Make use of resources that are values in rdf:type. (called Classes)
Definition 18.4.10. OWL (the ontology web language) is a language for encoding
TBox information about RDF classes.
Example 18.4.11 (A concept definition for “Mother”).
Mother=Woman ⊓ Parent is represented as
XML Syntax Functional Syntax

<EquivalentClasses> EquivalentClasses(
<Class IRI="Mother"/> :Mother
<ObjectIntersectionOf> ObjectIntersectionOf(
<Class IRI="Woman"/> :Woman
<Class IRI="Parent"/> :Parent
</ObjectIntersectionOf> )
</EquivalentClasses> )
But there are also other syntaxes in regular use. We show the functional syntax which is inspired
by the mathematical notation of relations.
Extended OWL Example in Functional Syntax

Example 18.4.12. The semantic network from Example 18.1.5 can be expressed
in OWL (in functional syntax)
bird Jack Person

isa inst
inst inst
loves
wings John
ClassAssertion (:Jack :robin)

ClassAssertion(:John :person)
ClassAssertion (:Mary :person)
ObjectPropertyAssertion(:loves :John :Mary)
ObjectPropertyAssertion(:owner :John :Jack)
SubClassOf(:robin :bird)
SubClassOf (:bird ObjectSomeValuesFrom(:hasPart :wing))
ClassAssertion formalizes the “inst” relation,

ObjectPropertyAssertion formalizes relations,
SubClassOf formalizes the “isa” relation,
for the “has_part” relation, we have to specify that all birds have a part that
is a wing or equivalently the class of birds is a subclass of all objects that
have some wing.
We have introduced the ideas behind using description logics as the basis of a “machine-oriented
web of data”. While the first OWL specification (2004) had three sublanguages “OWL Lite”, “OWL
DL” and “OWL Full”, of which only the middle was based on description logics, with the OWL2
Recommendation from 2009, the foundation in description logics was nearly universally accepted.
The semantic web hype is by now nearly over, the technology has reached the “plateau of
productivity” with many applications being pursued in academia and industry. We will not go
into these, but briefly instroduce one of the tools that make this work.
SPARQL an RDF Query language

Definition 18.4.13. SPARQL, the “SPARQL Protocol and RDF Query Language”
is an RDF query language, able to retrieve and manipulate data stored in RDF.
The SPARQL language was standardized by the World Wide Web Consortium in
2008 [PS08].
SPARQL is pronounced like the word “sparkle”.
Definition 18.4.14. A system is called a SPARQL endpoint, iff it answers SPARQL

queries.
Example 18.4.15.
Query for person names and their e-mails from a triplestore with FOAF data.
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name ?email
WHERE {
?person a foaf:Person.
?person foaf:name ?name.
?person foaf:mbox ?email.
}
SPARQL end-points can be used to build interesting applications, if fed with the appropriate data.
An interesting – and by now paradigmatic – example is the DBPedia project, which builds a large
ontology by analyzing Wikipedia fact boxes. These are in a standard HTML form which can be
analyzed e.g. by regular expressions, and their entries are essentially already in triple form: The
subject is the Wikipedia page they are on, the predicate is the key, and the object is either the
URI on the object value (if it carries a link) or the value itself.
SPARQL Applications: DBPedia

Typical Application: DBPedia screen-scrapes
Wikipedia fact boxes for RDF triples and uses SPARQL
for querying the induced triplestore.
Example 18.4.16 (DBPedia Query).

People who were born in Erlangen before 1900
(http://dbpedia.org/snorql)
SELECT ?name ?birth ?death ?person WHERE {
?person dbo:birthPlace :Erlangen .
?person dbo:birthDate ?birth .
?person foaf:name ?name .
?person dbo:deathDate ?death .
FILTER (?birth < "1900−01−01"^^xsd:date) .
}
ORDER BY ?name
The answers include Emmy Noether and Georg Simon

Ohm.
A more complex DBPedia Query

Demo: DBPedia http://dbpedia.org/snorql/
Query: Soccer players born in a country with more than 10 M inhabitants, who play
as goalie in a club that has a stadium with more than 30.000 seats.
Answer: computed by DBPedia from a SPARQL query
We conclude our survey of the semantic web technology stack with the notion of a triplestore,
which refers to the database component, which stores vast collections of ABox triples.
Triple Stores: the Semantic Web Databases

Definition 18.4.17. A triplestore or RDF store is a purpose-built database for
the storage RDF graphs and retrieval of RDF triples usually through variants of
SPARQL.
Common triplestores include

Virtuoso: https://virtuoso.openlinksw.com/ (used in DBpedia)
GraphDB: http://graphdb.ontotext.com/ (often used in WissKI)
blazegraph: https://blazegraph.com/ (open source; used in WikiData)
Definition 18.4.18. A description logic reasoner implements of reaonsing services

based on a satisfiabiltiy test for description logics.
Common description logic reasoners include
FACT++: http://owl.man.ac.uk/factplusplus/
HermiT: http://www.hermit-reasoner.com/
Intuition: Triplestores concentrate on querying very large ABoxes with partial

consideration of the TBox, while DL reasoners concentrate on the full set of ontology
inference services, but fail on large ABoxes.

Part IV
Planning & Acting
341
343
This part covers the AI subfield of “planning”, i.e. search-based problem solving with a
structured representation language for environment state and actions — in planning, the focus is
on the latter.
We first introduce the framework of planning (structured representation languages for problems
and actions) and then present algorithms and complexity results. Finally, we lift some of the
simplifying assumptions – deterministic, fully observable environments – we made in the previous
parts of the course.
344
Chapter 19
Planning I: Framework
Reminder: Classical Search Problems

Example 19.0.1 (Solitaire as a Search Problem).
States: Card positions (e.g. position_Jspades=Qhearts).

Actions: Card moves (e.g. move_Jspades_Qhearts_freecell4).
Initial state: Start configuration.
Goal states: All cards “home”.
Solutions: Card moves solving this game.
Planning
Ambition: Write one program that can solve all classical search problems.
Idea: For CSP, going from “state/action-level search” to “problem-description
level search” did the trick.
Definition 19.0.2. Let Π be a search problem (see chapter 8)
The blackbox description of Π is an API providing functionality allowing to

construct the state space: InitialState(), GoalTest(s), . . .
345
346 CHAPTER 19. PLANNING I: FRAMEWORK
“Specifying the problem” =

b programming the API.
The declarative description of Π comes in a problem description language. This
allows to implement the API, and much more.
“Specifying the problem” =
b writing a problem description.
Here, “problem description language” =

b planning language. (up next)
But Wait: Didn’t we do this already in the last chapter with logics? (For the
Wumpus?)
19.1 Logic-Based Planning

Before we go into the planning framework and its particular methods, let us see what we would do
with the methods from Part III if we were to develop a “logic-based language” for describing states
and actions. We will use the Wumpus world from section 12.1 as a running example.
Fluents: Time-Dependent Knowledge in Planning

Recall from section 12.1: We can represent the Wumpus rules in logical systems
(propositional/first-order/ALC)
Use inference systems to deduce new world knowledge from percepts and actions.
Problem: Representing (changing) percepts immediately leads to contradictions!
Example 19.1.1. If the agent moves and a cell with a draft (a perceived breeze)
is followed by one without.
Obvious Idea: Make representations of percepts time-dependent

Example 19.1.2. Dt for t∈N for PL0 and draft(t) in PL1 and PLnq .
Definition 19.1.3. We use the word fluent to refer an aspect of the world that
changes, all others we call atemporal.
Let us recall the agent-based setting we were using for the inference procedures from Part III. We
will elaborate this further in this section.
Recap: Logic-Based Agents

Recall: A model based agent uses inference to model the environment, percept,
and actions.
19.1. LOGIC-BASED
Section 2.4. PLANNING
The Structure of Agents 51 347
Sensors
State
is like now
Environment
What my actions do

should do now
Agent Actuators

persistent:function
KB,MaODEL knowledge
a counter,
of the world state
state ← U PDATE -S TATE (state, action , percept , model )
Still Unspecified:
using an internal model. It then chooses an action in the same way as the reflex agent. (up next)
MAKE−PERCEPT−SENTENCE:
is responsible for creating the new internalthe
state effects
description.ofThe
percepts.
details of how models and
states are represented vary widely depending on the type of environment and the particular
MAKE−ACTION−QUERY: what
technology used in the agent design. is the
Detailed bestofnext
examples modelsaction?
and updating algorithms
MAKE−ACTION−SENTENCE: the effects
Regardless of the kind of representation ofseldom
used, it is thatpossible
action. for the agent to
In particular,labeled
we “what
will look atisthe
the world effect
like now” of time/change.
(Figure (neglected
2.11) represents the agent’s “best guess” (or so far)
hold-up. Thus, uncertainty about the current state may be unavoidable, but the agent still has
to make aMichael Kohlhase: Artificial Intelligence 2
decision. 563 2023-02-10
A perhaps less obvious point about the internal “state” maintained by a model-based
Now that we have the notion of fluents to represent the percepts at a given time point, let us try
to model how they influence the agent’s world model.
Fluents: Modeling the Agent’s Sensors

Idea: Relate percept fluents to atemporal cell attributes.
Example 19.1.4. E.g., if the agent perceives a draft at time t, when it is in
cell [x, y], then there must be a breeze there: ∀t, x, y Ag@(t, x, y) ⇒ draft(t) ⇔
breeze(x, y).
Axiom like these model the agent’s sensors – here that they are totally reliable:
there is a breeze, iff the agent feels a draft.
Definition 19.1.5. We call fluents that describe the agent’s sensors sensor axioms.
Problem: Where do fluents like Ag@(t, x, y) come from?
You may have noticed that for the sensor axioms we have only used first-order logic. There is a
general story to tell here: if we have finite domains (as we do in the Wumpus cave) we can always
“compile first-order logic” into propositional logic. We will develop this here before we go on with
the Wumpus models.
Digression: Fluents and Finite Temporal Domains

Observation: Fluents like ∀t, x, y Ag@(t, x, y) ⇒ draft(t) ⇔ breeze(x, y) from
Example 19.1.4 are best represented in first-order logic. In PL0 and PLnq we would
have to use concrete instances like Ag@(7, 2, 1) ⇒ draft(7) ⇔ breeze(2, 1) for all
suitable t, x, and y.
Problem: Unless we restrict ourselves to finite domains and an end time tend
we have infinitely many axioms. Even then, formalization in PL0 and PLnq is very
tedious.
Solution: Formalize in first-order logic and then compile down:

1. enumerate ranges of bound variables, instantiate body, (; PLnq )
2. translate PLnq atoms to propositional variables. (; PL0 )
In Practice: The choice of domain, end time, and logic is up to agent designer,
weighing expressivity vs. efficiency of inference.
WLOG: We will use PL1 in the following. (easier to read)
We now continue to our logic-based agent models: Now we focus on effect axioms to model the
effects of an agent’s actions.
Fluents: Effect Axioms for the Transition Model

Problem: Where do fluents like Ag@(t, x, y) come from?
Thus: We also need fluents to keep track of the agent’s actions. (The transition
model of the underlying search problem).
Idea: We also use fluents for the representation of actions.
Example 19.1.6. The action of “going forward” at time t is captured by the fluent
forw(t).
Definition 19.1.7. Effect axioms describe how the environment change under an
agent’s actions.
Example 19.1.8. If the agent is in cell [1, 1] facing east at time 0 and goes
forwardq, she is in cell [2, 1] and no longer in [1, 1]:
Ag@(0, 1, 1) ∧ faceeast(0) ∧ forw(0) ⇒ Ag@(1, 2, 1) ∧ ¬Ag@(1, 1, 1)
Generally: (barring exceptions for domain border cells)
∀t, x, y Ag@(t, x, y) ∧ faceeast(t) ∧ forw(t) ⇒ Ag@(t + 1, x + 1, y) ∧ ¬Ag@(t + 1, x, y)
This compiles down to 16 · tend PLnq /PL0 axioms.
Unfortunately, the percept fluents, sensor axioms, and effect axioms are not enough, as we will
show in ??. We will see that this is a more general problem – the famous frame problem that
19.1. LOGIC-BASED PLANNING 349
needs to be considered whenever we deal with change in environments.
Frame and Frame Axioms

Problem: Effect axioms are not enough.
Example 19.1.9.[id=wumpus-arrow-frame.ex] Say that the agent has an arrow at
time 0, and then moves forward into [2, 1], perceives a glitter, and knows that the
Wumpus is ahead.
To evaluate the action shoot(1), it the corresponding effect axiom needs to know
havarrow(1), but cannot prove it from havarrow(0).
Problem: The information of having an arrow has been lost in the move forward.
Definition 19.1.10. The frame problem describes that for a representation of
actions we need to formalize the not their effects on the aspects they change, but
also their non-effect on the static frame of reference.
Partial Solution: (there are many many more; some better)
Frame axioms formalize that particular fluents are invariant under a given action.
Problem: For an agent with n actions and an environment with m fluents, we
need O(nm) frame axioms.
Representing and reasoning with them easily drowns out the sensor and transition
models.
We conclude our discussion with a rellatively complete implementation of a logic-based Wumpus

agent, building on the schema from slide 563.
A Hybrid Agent for the Wumpus World

Example 19.1.11 (A Hybrid Agent). This agent uses
logic inference for sensor and transition modeling,
special code and A∗ for action selection & route planning.
function HYBRID−WUMPUS−AGENT(percept) returns an action

inputs: percept, a list, [stench,breeze,glitter,bump,scream]
persistent: KB , a knowledge base, initially the atemporal "wumpus physics"
t, a counter, initially 0, indicating time
plan, an action sequence, initially empty
TELL(KB, the temporal "physics" sentences for time t)
then some special code for action selection (up next)

t := t + 1
return action
So far, not much new over our original version.

Now look at the “special code” we have promised.
A Hybrid Agent: Custom Action Selection

Example 19.1.12 (A Hybrid Agent).
saf e := {[x, y]|ASK(KB,OK(t, x, y))=T}
if ASK(KB,glitter(t)) = T then
plan := [grab] + PLAN−ROUTE(current,{[1, 1]},saf e) + [exit]
if plan is empty then
unvisited := {[x, y]|ASK(KB,Ag@(t′ , x, y))=F} for all t′ ≤t
plan := PLAN−ROUTE(current,unvisited ∪ saf e,saf e)
if plan is empty and ASK(KB,havarrow(t)) = T then
possible_wumpus := {x, y|[x, y]}ASK(KB,¬W x, y) = F
plan := PLAN−SHOT(current,possible_wumpus,saf e)
if plan is empty then // no choice but to take a risk
not_unsaf e := {[x, y]|ASK(KB,¬OKtx, y) = F}
plan := PLAN−ROUTE(current,unvisited ∪ not_unsaf e,saf e)
if plan is empty then
plan := PLAN−ROUTE(current,{[1, 1]},saf e) + [exit]
action := POP(plan)
Note that OK and glitter are fluents, since the Wumpus might have died or the gold
might have been grabbed.
And finally the route planning part of the code. This is essentially just A∗ search.
A Hybrid Agent: Custom Action Selection

Example 19.1.13. And the code for PLAN−ROUTE (PLAN−SHOT similar)
function PLAN−ROUTE(current,goals,allowed) returns an action sequence
inputs: current, the agent’s current position
goals, a set of squares; try to plan a route to one of them
allowed, a set of squares that can form part of the route
problem := ROUTE−PROBLEM(current,goals,allowed)
return A∗ (problem)
Evaluation: Even though this works for the Wumpus world, it is not the “universal,
logic-based problem solver” we dreamed of!
Planning tries to solve this with another representation of actions. (up next)
19.2 Planning: Introduction

How does a planning language describe a problem?

19.2. PLANNING: INTRODUCTION 351
Definition 19.2.1. A planning language is a logical language for the components

of a search problem; in particular a FAUlogical description of the
possible states (vs. blackbox: data structures). (E.g.: predicate Eq(., .).)
initial state I (vs. data structures). (E.g.: Eq(x, 1).)
goal test G (vs. a goal test function). (E.g.: Eq(x, 2).)
set A of actions in terms of preconditions and effects (vs. functions returning
applicable actions and successor states). (E.g.: “increment x: pre Eq(x, 1), eff
Eq(x ∧ 2) ∧ ¬Eq(x, 1)”.)
A logical description of all of these is called a planning task.

Definition 19.2.2. Solution (plan) =
b sequence of actions from A, transforming I
into a state that satisfies G. (E.g.: “increment x”.)
The process of finding a plan given a planning task is called planning.
Planning Language Overview

Disclaimer: Planning languages go way beyond classical search problems. There
are variants for inaccessible, stochastic, dynamic, continuous, and multi-agent set-
tings.
We focus on classical search for simplicity (and practical relevance).
For a comprehensive overview, see [GNT04].
Application: Natural Language Generation

S:e {sleep(e,r1)}
NP:r1 ↓ VP:e S:e
V:e NP:r1 VP:e
sleeps the N:r1 V:e
NP:r1 N:r1 white rabbit sleeps

the
N:r1 white N:r1 *
{rabbit(r1)} {white(r1)}
rabbit
Input: Tree-adjoining grammar, intended meaning.

Output: Sentence expressing that meaning.

352 Application: Business
Business Process
Process Templates
Templates19.at
CHAPTER SAP
atPLANNING
SAP I: FRAMEWORK
Application: Business Process Templates at SAP

Approval:
Approval:
Necessary
Necessary
Approval: Decide CQ
Approval:
not Decide CQ
not
Necessary Approval
Create CQ Necessary Approval
Create CQ
Submit CQ
Submit CQ
Check CQ Check CQ
Check CQ Check CQ
Completeness Consistency
Completeness Consistency
Mark CQ as
Mark CQ as
Accepted
Accepted
Create Follow-
Create Follow-
Up for CQ
Up for CQ
Check CQ
Check CQ
Approval
Approval
Status
Status Archive CQ
Archive CQ
I Input: model of behavior of activities on business objects, process endpoint.

I Input:
Input: SAP-scale
SAP-scale model
model of
of behavior
behavior of
of activities
activities on
on Business
Business Objects,
Objects, process
process
endpoint.
Output: Process template leading to this point.
endpoint.
I
I Output:
Output: Process
Process template
template leading
leading to
to this
this point.
point.

Application: Automatic Hacking
DMZ
Web Server Application Server
Internet
Router
Firewall
Attacker
Workstation
DB Server
SENSITIVE USERS
DMZ
Internet
Router
Firewall
Attacker
Workstation
DB Server
SENSITIVE USERS
DMZ
Internet
Router
Firewall
Attacker
Workstation
DB Server
SENSITIVE USERS
DMZ
Internet
Router
Firewall
Attacker
Workstation
DB Server
SENSITIVE USERS
Input: Network configuration, location of sensible data.

Output: Sequence of exploits giving access to that data.
Reminder: General Problem Solving, Pros and Cons

Powerful: In some applications, generality is absolutely necessary. (E.g. SAP)
Quick: Rapid prototyping: 10s lines of problem description vs. 1000s lines of C++
code. (E.g. language generation)
Flexible: Adapt/maintain the description. (E.g. network security)
Intelligent: Determines automatically how to solve a complex problem effectively!
(The ultimate goal, no?!)
Efficiency loss: Without any domain-specific knowledge about chess, you don’t
beat Kasparov . . .
Trade-off between “automatic and general” vs. “manual work but effective”.
Research Question: How to make fully automatic algorithms effective?

Search vs. planning

Consider the task get milk, bananas, and a cordless drill
Standard search algorithms seem to fail miserably:
After-the-fact heuristic/goal test inadequate
Planning systems do the following:

1. open up action and goal representation to allow selection
2. divide-and-conquer by subgoaling
relax requirement for sequential construction of solutions
Search Planning
States Lisp data structures Logical sentences
Actions Lisp code Preconditions/outcomes
Goal Lisp code Logical sentence (conjunction)
Plan Sequence from S0 Constraints on actions
Reminder: Greedy Best-First Search and A∗

Duplicate elimination omitted for simplicity:
function Greedy_Best−First_Search [A∗ ](problem) returns a solution, or failure
node := a node n with n.state=problem.InitialState
frontier := a priority queue ordered by ascending h [g + h], only element n
loop do
if Empty?(frontier) then return failure
n := Pop(frontier)
if problem.GoalTest(n.State) then return Solution(n)
for each action a in problem.Actions(n.State) do
n′ := ChildNode(problem,n,a)
Insert(n′ , h(n′ ) [g(n′ ) + h(n′ )], frontier)
Is Greedy Best-First Search optimal? No ; satisficing planning.

Is A∗ optimal? Yes, but only if h is admissible ; optimal planning, with such h.
ps. “Making Fully Automatic Algorithms Effective”

Example 19.2.3.
n blocks, 1 hand.
A single action either takes a block with the hand or puts a
block we’re holding onto some other block/the table.
blocks states blocks states

1 1 9 4596553
2 3 10 58941091
3 13 11 824073141
4 73 12 12470162233
5 501 13 202976401213
6 4051 14 3535017524403
7 37633 15 65573803186921
8 394353 16 1290434218669921
Observation 19.2.4. State spaces typically are huge even for simple problems.
In other words: Even solving “simple problems” automatically (without help from
a human) requires a form of intelligence.
With blind search, even the largest super computer in the world won’t scale beyond
20 blocks!
Algorithmic Problems in Planning

Definition 19.2.5. We speak of satisficing planning if
Input: A planning task Π.
Output: A plan for Π, or “unsolvable” if no plan for Π exists.
and of optimal planning if
Input: A planning task Π.
Output: An optimal plan for Π, or “unsolvable” if no plan for Π exists.
The techniques successful for either one of these are almost disjoint. And satisficing
planning is much more effective in practice.
Definition 19.2.6. Programs solving these problems are called (optimal) planner,
planning system, or planning tool.

Now: Background, planning languages, complexity.
Sets up the framework. Computational complexity is essential to distinguish

different algorithmic problems, and for the design of heuristic functions. (see
next)
Next: How to automatically generate a heuristic function, given planning lan-
guage input?
Focussing on heuristic search as the solution method, this is the main question
that needs to be answered.

1. The History of Planning: How did this come about?
Gives you some background, and motivates our choice to focus on heuristic
search.
2. The STRIPS Planning Formalism: Which concrete planning formalism will we be
using?
Lays the framework we’ll be looking at.
3. The PDDL Language: What do the input files for off-the-shelf planning software
look like?
So you can actually play around with such software. (Exercises!)

4. Planning Complexity: How complex is planning?
The price of generality is complexity, and here’s what that “price” is, exactly.
19.3 The History of Planning

Planning History: In the Beginning . . .

In the beginning: Man invented Robots:
“Planning” as in “the making of plans by an autonomous robot”.

Shakey the Robot (Full video here)
19.3. PLANNING HISTORY 357
In a little more detail:

[NS63] introduced general problem solving.
. . . not much happened (well not much we still speak of today) . . .
1966-72, Stanford Research Institute developed a robot named “Shakey”.
They needed a “planning” component taking decisions.
They took inspiration from general problem solving and theorem proving, and
called the resulting algorithm STRIPS.
History of Planning Algorithms

Compilation into Logics/Theorem Proving:
e.g. ∃s0 , a, s1 .at(A, s0 ) ∧ execute(s0 , a, s1 ) ∧ at(B, s1 )
Popular when: Stone Age – 1990.
Approach: From planning task description, generate PL1 formula φ that is
satisfiable iff there exists a plan; use a theorem prover on φ.
Keywords/cites: Situation calculus, frame problem, . . .
Partial order planning
e.g. open = {at(B)}; apply move(A, B); ; open = {at(A)} . . .
Popular when: 1990 – 1995.
Approach: Starting at goal, extend partially ordered set of actions by inserting
achievers for open sub-goals, or by adding ordering constraints to avoid conflicts.
Keywords/cites: UCPOP [PW92], causal links, flaw selection strategies, . . .
History of Planning Algorithms, ctd.

GraphPlan
e.g. F0 = at(A); A0 = {move(A, B)}; F1 = {at(B)};
mutex A0 = {move(A, B), move(A, C)}.
Popular when: 1995 – 2000.
Approach: In a forward phase, build a layered “planning graph” whose “time
steps” capture which pairs of actions can achieve which pairs of facts; in a
backward phase, search this graph starting at goals and excluding options proved
to not be feasible.
Keywords/cites: [BF95; BF97; Koe+97], action/fact mutexes, step-optimal
plans, . . .
Planning as SAT:
SAT variables at(A)0 , at(B)0 , move(A, B)0 , move(A, C)0 , at(A)1 , at(B)1 ;
F T
clauses to encode transition behavior e.g. at(B)1 ∨move(A, B)0 ; unit clauses
T T T
to encode initial state at(A)0 , at(B)0 ; unit clauses to encode goal at(B)1 .
Popular when: 1996 – today.
Approach: From planning task description, generate propositional CNF formula
φk that is satisfiable iff there exists a plan with k steps; use a SAT solver on φk ,
for different values of k.
Keywords/cites: [KS92; KS98; RHN06; Rin10], SAT encoding schemes, Black-
Box, . . .
History of Planning Algorithms, ctd.

Planning as Heuristic Search:
init at(A); apply move(A, B); generates state at(B); . . .
Popular when: 1999 – today.
Approach: Devise a method R to simplify (“relax”) any planning task Π; given
Π, solve R(Π) to generate a heuristic function h for informed search.
Keywords/cites: [BG99; HG00; BG01; HN01; Ede01; GSS03; Hel06; HHH07;
HG08; KD09; HD09; RW10; NHH11; KHH12a; KHH12b; KHD13; DHK15],
critical path heuristics, ignoring delete lists, relaxed plans, landmark heuristics,
abstractions, partial delete relaxation, . . .
The International Planning Competition (IPC)

Definition 19.3.1. The International Planning Competition (IPC) is an event for
benchmarking planners (http://ipc.icapsconference.org/)
How: Run competing planners on a set of benchmarks.
When: Runs every two years since 2000, annually since 2014.
What: Optimal track vs. satisficing track; others: uncertainty, learning, . . .
Prerequisite/Result:
Standard representation language: PDDL [McD+98; FL03; HE05; Ger+09]
Problem Corpus: ≈ 50 domains, ≫ 1000 instances, 74 (!!) planners in 2011
International Planning Competition

Question: If planners x and y compete in IPC’YY, and x wins, is x “better than”
19.4. STRIPS PLANNING 359
y?
Generally: reserved for the plenary sessions ; be there!
Planning History, p.s.: Planning is Non-Trivial!

Example 19.3.2. The Sussman anomaly is a simple blocksworld planning problem:
A
C B
A B C
Simple planners that split the goal into subgoals on(A, B) and on(B, C) fail:
If we pursue on(A, B) by unstacking C, and

moving A onto B, we achieve the first subgoal,
but cannot achieve the second without undoing
the first.
If we pursue on(B, C) by moving B onto C, we
achieve the second subgoal, but cannot achieve
the first without undoing the second.
19.4 The STRIPS Planning Formalism

STRIPS Planning
Definition 19.4.1. STRIPS = Stanford Research Institute Problem Solver.
STRIPS is the simplest possible (reasonably expressive) logics based planning
language.
STRIPS has only propositional variables as atomic formulae.

Its preconditions/effects/goals are as canonical as imaginable:
Preconditions, goals: conjunctions of atoms.
Effects: conjunctions of literals
We use the common special-case notation for this simple formalism.

I’ll outline some extensions beyond STRIPS later on, when we discuss PDDL.
Historical note: STRIPS [FN71] was originally a planner (cf. Shakey), whose
language actually wasn’t quite that simple.
STRIPS Planning: Syntax

Definition 19.4.2. A STRIPS task is a quadruple ⟨P , A, I , G⟩ where:
P is a finite set of facts (aka proposition).

A is a finite set of actions; each a∈A is a triple a = ⟨prea , adda , dela ⟩ of subsets
of P referred to as the action’s precondition, add list, and delete list respectively;
we require that adda ∩ dela = ∅.
I ⊆ P is the initial state.
G ⊆ P is the goal.
We will often give each action a∈A a name (a string), and identify a with that
name.
Note: We assume, for simplicity, that every action has cost 1. (Unit costs, cf.
chapter 8)
“TSP” in Australia
Example 19.4.3 (Salesman Travelling in Australia).
Strictly speaking, this is not actually a TSP problem instance; simplified/adapted

for illustration.
STRIPS Encoding of “TSP”

Example 19.4.4 (continuing).
Facts P : {at(x), vis(x)|x∈{Sy, Ad, Br, Pe, Da}}.

Initial state I: {at(Sy), vis(Sy)}.
Goal G:{at(Sy)} ∪ {vis(x)|x∈{Sy, Ad, Br, Pe, Da}}.
Actions a∈A: drv(x, y) where x and y have a road.
Preconditions prea : {at(x)}.
Add list adda : {at(y), vis(y)}.
Delete list dela : {at(x)}.
Plan: ⟨drv(Sy, Br), drv(Br, Sy), drv(Sy, Ad), drv(Ad, Pe), drv(Pe, Ad), . . .
. . . , drv(Ad, Da), drv(Da, Ad), drv(Ad, Sy)⟩
STRIPS Planning: Semantics

Idea: We define a plan for a STRIPS task Π as a solution to an induced search
problem ΘΠ . (save work by reduction)
Definition 19.4.5. Let Π:=⟨P , A, I , G⟩ be a STRIPS task. The search problem

induced by Π is ΘΠ = ⟨S P , A, T A , I, S G ⟩ where:
The states (also world state) S G :=P(P ) are the subsets of P .
A is just Π’s action set. (so we can define plans easily)
a
The transition model T A is {s − → apply(s, a)|prea ⊆ s}.
If prea ⊆ s, then a∈A is applicable in s and apply(s, a):=s ∪ adda \dela . If
prea ̸⊆s, then apply(s, a) is undefined.
I is Π’s initial state.
The goal states S G = {s∈S G |G ⊆ s} are those that satisfy Π’s goal.
An (optimal) plan for Π is an (optimal) solution ΘΠ , i.e., a path from s to some
s′ ∈S G . A solution for I is called a plan for Π. Π is solvable if a plan for Π exists.
For a = ⟨a1 , . . ., an ⟩, apply(s, a):=apply(s, apply(s, a2 . . . apply(s, an ))) if each ai

is applicable in the respective state; else, apply(s, a) is undefined.
STRIPS Encoding of Simplified “TSP”

Example 19.4.6 (Simplified Traveling Salesman Problem in Australia).
Let TSP− be the STRIPS task, ⟨P , A, I , G⟩, where
Facts P : {at(x), vis(x)|x∈{Sy, Ad, Br}}.

Initial state I: {at(Sy), vis(Sy)}.
Goal G: {vis(x)|x∈{Sy, Ad, Br}} (note: noat(Sy))
Actions A: a∈A: drv(x, y) where x y have a road.
preconditions prea : {at(x)}.
add list adda : {at(y), vis(y)}.
delete list dela : {at(x)}.
Questionaire: State Space of TSP−

The state space of the search problem ΘTSP− induced by TSP− from Example 19.4.6
is
at(Ad)
at(Br) drv(Br, Sy) at(Sy) drv(Sy, Ad) vis(Sy)
vis(Sy) vis(Sy)
vis(Br)
vis(Br) vis(Br)
vis(Ad)
drv(Ad, Sy)
drv(Sy, Br)
at(Sy)
at(Sy) vis(Sy)
vis(Sy) vis(Ad)
vis(Br)
drv(Sy, Ad)
drv(Br, Sy)
at(Br)
at(Ad) at(Sy)
vis(Sy)
vis(Sy) vis(Sy)
drv(Ad, Sy) drv(Sy, Br) vis(Ad)
vis(Ad) vis(Ad)
vis(Br)
Question: Are there any plans for TSP− in this graph?

Answer: Yes, two – plans for TSP− are solutions for ΘTSP− , dashed node =
b I,
thick nodes =
b G:
drv(Sy, Br), drv(Br, Sy), drv(Sy, Ad) (upper path)
drv(Sy, Ad), drv(Ad, Sy), drv(Sy, Br). (lower path)
Question: Is the graph above actually the state space induced by ?

Answer: No, only the part reachable from I. The state space of ΘTSP− also
includes e.g. the states {vis(Sy)} and {at(Sy), at(Br)}.
The Blocksworld
Definition 19.4.7. The blocks world is a simple planning domain: a set of wooden
blocks of various shapes and colors sit on a table. The goal is to build one or more
vertical stacks of blocks. Only one block may be moved at a time: it may either be
placed on the table or placed atop another block.
Example 19.4.8.
E
D C B
E A B C A D
Initial State Goal State
Facts: on(x, y), onTable(x), clear(x), holding(x), armEmpty.

Initial state: {onTable(E), clear(E), . . . , onTable(C), on(D, C), clear(D), armEmpty}.
Goal: {on(E, C), on(C, A), on(B, D)}.
Actions: stack(x, y), unstack(x, y), putdown(x), pickup(x).
stack(x, y)?
pre : {holding(x), clear(y)}
add : {on(x, y), armEmpty}
del : {holding(x), clear(y)}.
STRIPS for the Blocksworld

Question: Which are correct encodings (ones that are part of some correct overall
model) of the STRIPS Blocksworld pickup(x) action schema?
{onTable(x), clear(x), armEmpty} {onTable(x), clear(x), armEmpty}

(A) {holding(x)} (B) {holding(x)}
{onTable(x)} {armEmpty}
{onTable(x), clear(x), armEmpty} {onTable(x), clear(x), armEmpty}

(C) {holding(x)} (D) {holding(x)}
{onTable(x), armEmpty, clear(x)} {onTable(x), armEmpty}
Recall: an actions a represented by a tuple ⟨prea , adda , dela ⟩ of lists of facts.

Hint: The only differences between them are the delete lists
The next example for a planning problem is not obvious at first sight, but has been quite influential,
showing that many industry problems can be specified declaratively by formalizing the domain
and the particular planning problems in PDDL and then using off-the-shelf planners to solve them.
[KS00] reports that this has significantly reduced labor costs and increased maintainability of the
implementation.
Miconic-10: A Real-World Example

Example 19.4.9. Elevator control as a planning problem; details at [KS00]
Specify mobility needs before boarding, let a planner schedule/otimize trips
VIP D
VIP: Served first.

D: Lift may only go down when inside; sim- ???
U
ilar for U.
B
NA: Never-alone
AT: Attendant.
AT

A, B: Never together in the same elevator NA
P: Normal passenger
19.5 Partial Order Planning

In this section we introduce a new and different planning algorithm: partial order planning that
works on several subgoals independently without having to specify in which order they will be
pursued and later combines them into a global plan. A Video Nugget covering this section can
be found at https://fau.tv/clip/id/28843.
To fortify our intuitions about partial order planning let us have another look at the Sussman
anomaly, where pursuing two subgoals independently and then reconciling them is a prerequi-
site.
Planning History, p.s.: Planning is Non-Trivial!

19.5. PARTIAL ORDER PLANNING 365
Example 19.5.1. The Sussman anomaly is a simple blocksworld planning problem:
A
C B
A B C
Simple planners that split the goal into subgoals on(A, B) and on(B, C) fail:
If we pursue on(A, B) by unstacking C, and

moving A onto B, we achieve the first subgoal,
but cannot achieve the second without undoing
the first.
If we pursue on(B, C) by moving B onto C, we
achieve the second subgoal, but cannot achieve
the first without undoing the second.
Before we go into the details, let us try to understand the main ideas of partial order planning.
Partial Order Planning

Definition 19.5.2. Any algorithm that can place two actions into a plan without
specifying which comes first is called as partial order planning.
Ideas for partial order planning:
Organize the planning steps in a DAG that supports multiple paths from initial
to goal state
nodes (steps) are labeled with actions (actions can occur multiply)
edges with propositions added by source and presupposed by target
acyclicity of the graph induces a partial ordering on steps. q

additional temporal constraints resolve subgoal interactions and induce a linear
order.
Advantages of partial order planning:
problems can be decomposed ; can work well with non-cooperative environ-

ments.
efficient by least-commitment strategy
causal links (edges) pinpoint unworkable subplans early.
We now make the ideas discussed above concrete by giving a mathematical formulation. It is
advantageous to cast a partially ordered plan as a labeled DAG rather than a partial ordering
since it draws the attention to the difference between actions and steps.
Partially Ordered Plans

Definition 19.5.3. Let ⟨P , A, I , G⟩ be a STRIPS task, then a partially ordered
plan P = ⟨V , E ⟩ is a labeled DAG, where the nodes in V (called steps) are labeled
with actions from A, or are a
start step, which has label effect I, or a
finish step, which has label precondition G.
Every edge (S,T )∈E is either labeled by:
A non-empty set p ⊆ P of facts that are effects of the action of S and the
preconditions of that of T . We call such a labeled edge a causal link and write
p
it S −→ T.
≺, then call it a temporal constraint and write it as S ≺ T .
An open condition is a precondition of a step not yet causally linked.
Definition 19.5.4. Let be a partially ordered plan Π, then we call a step U possibly
p
intervening in a causal link S −→ T , iff Π ∪ {S ≺ U , U ≺ T } is acyclic.
Definition 19.5.5. A precondition is achieved iff it is the effect of an earlier step
and no possibly intervening step undoes it.
Definition 19.5.6. A partially ordered plan Π is called complete iff every precon-
dition is achieved.
Definition 19.5.7. Partial order planning is the process of computing complete
and acyclic partially ordered plans for a given planning task.
A Notation for STRIPS Actions

Notation: Write STRIPS actions into boxes with preconditions above and effects
below.
Example 19.5.8.
Actions: Buy(x)
At(p) Sells(p, x)
Preconditions: At(p), Sells(p, x)
Buy(x)
Effects: Have(x) Have(x)
p
Notation: A causal link S −→ T can also be denoted by a direct arrow between the
effects p of S and the preconditions p of T in the STRIPS action notation above.
Show temporal constraints as dashed arrows.

Planning Process
Definition 19.5.9. Partial order planning is search in the space of partial plans via
the following operations:
add link from an existing action to an open precondition,
add step (an action with links to other steps) to fulfil an open condition,
order one step wrt. another to remove possible conflicts.
Idea: Gradually move from incomplete/vague plans to complete, correct plans.
Backtrack if an open condition is unachievable or if a conflict is unresolvable.
Example: Shopping for Bananas, Milk, and a Cordless Drill

Clobbering and Promotion/Demotion

Definition 19.5.10. In a partially ordered plan, a step C clobbers a causal link
p
L:=S −→ T , iff it destroys the condition p achieved by L.
p
Definition 19.5.11. If C clobbers S −→ T in a partially ordered plan Π, then we
can solve the induced conflict by
demotion: add a temporal constraint C ≺ S to Π, or
promotion: add T ≺ C to Π.
Example 19.5.12. Go(Home) clobbers At(Supermarket):
Go(SM )
At(SM )
demotion =
b put before
Go(Home)
At(Home)
At(SM ) promotion =
b put after
Buy(M ilk)
POP algorithm sketch

Definition 19.5.13. The POP algorithm for constructing complete partially or-
dered plans:
function POP (initial, goal, operators) : plan
plan:= Make−Minimal−Plan(initial, goal)
loop do
if Solution?(plan) then return plan
Sneed , c := Select−Subgoal(plan)
Choose−Operator(plan, operators, Sneed ,c)
Resolve−Threats(plan)
end
function Select−Subgoal (plan, Sneed , c)
pick a plan step Sneed from Steps(plan)
with a precondition c that has not been achieved
return Sneed , c
POP algorithm contd.

Definition 19.5.14. The missing parts for the POP algorithm.
function Choose−Operator (plan, operators, Sneed , c)
choose a step Sadd from operators or Steps(plan) that has c as an effect
if there is no such step then fail
add the ausal−link Sadd −→ c
Sneed to Links(plan)
add the temporal−constraint Sadd ≺ Sneed to Orderings(plan)

if Sadd is a newly added \step from operators then
add Sadd to Steps(plan)
add Start ≺ Sadd ≺ F inish to Orderings(plan)
function Resolve−Threats (plan)
for each Sthreat that threatens a causal−link Si −→
c
Sj in Links(plan) do
choose either
demotion: Add Sthreat ≺ Si to Orderings(plan)
promotion: Add Sj ≺ Sthreat to Orderings(plan)
if not Consistent(plan) then fail
Properties of POP
Nondeterministic algorithm: backtracks at choice points on failure:
choice of Sadd to achieve Sneed ,
choice of demotion or promotion for clobberer,
selection of Sneed is irrevocable.
Observation 19.5.15. POP is sound, complete, and systematic i.e. no repetition
There are extensions for disjunction, universals, negation, conditionals.
It can be made efficient with good heuristics derived from problem description.
Particularly good for problems with many loosely related subgoals.
Example: Solving the Sussman Anomaly

19.6. PDDL LANGUAGE 373
Example: Solving the Sussman Anomaly (contd.)

Example 19.5.16. Solving the Sussman anomaly
Start
On(C, A) On(A, T ) Cl(B) On(B, T ) Cl(C)
Cl(C) Refining
Refining
InitializingMtheRefining
ove(B,
ove(A, for
partial
AB)
C)for
clobbers
for
the
the
orderthe
totally subgoal
subgoal
subgoal
plan
ordered
withplan.
Cl(B)
Cl(C)ONwith
;(A,
On(B, demote.
Start
Cl(A).C). and Finish.
C).
M ove(C, T )
Cl(A) On(C, T )
Cl(B) Cl(C)
M ove(B, C)
¬Cl(C), On(B, C)
Cl(A)Cl(A) Cl(B)
M ove(A, B)
¬Cl(B) On(A, B)
On(A, B)On(A, B) On(B, C)On(B, C)

F inish
19.6 The PDDL Language

PDDL: Planning Domain Description Language

Definition 19.6.1. The Planning Domain Description Language (PDDL) is a stan-
dardized representation language for planning benchmarks in various extensions of
the STRIPS formalism.
Definition 19.6.2. PDDL is not a propositional language
Representation is lifted, using object variables to be instantiated from a finite

set of objects. (Similar to predicate logic)
Action schemas parameterized by objects.
Predicates to be instantiated with objects.
Definition 19.6.3. A PDDL planning task comes in two pieces

The problem file gives the objects, the initial state, and the goal state.
The domain file gives the predicates and the actions.
History and Versions:

• Used in the International Planning Competition (IPC).
• 1998: PDDL [McD+98].
• 2000: “PDDL subset for the 2000 competition” [Bac00].
• 2002: PDDL2.1, Levels 1-3 [FL03].
• 2004: PDDL2.2 [HE05].
• 2006: PDDL3 [Ger+09].
The Blocksworld in PDDL: Domain File
E
D C B
E A B C A D
(define (domain blocksworld)

(:predicates (clear ?x) (holding ?x) (on ?x ?y)
(on−table ?x) (arm−empty))
(:action stack
:parameters (?x ?y)
:precondition (and (clear ?y) (holding ?x))
:effect (and (arm−empty) (on ?x ?y)
(not (clear ?y)) (not (holding ?x))))
. . .)

19.6. PDDL LANGUAGE 375
The Blocksworld in PDDL: Problem File
E
D C B
E A B C A D
(define (problem bw−abcde)

(:domain blocksworld)
(:objects a b c d e)
(:init (on−table a) (clear a)
(on−table b) (clear b)
(on−table e) (clear e)
(on−table c) (on d c) (clear d)
(arm−empty))
(:goal (and (on e c) (on c a) (on b d))))
Miconic-ADL “Stop” Action Schema in PDDL

(:action stop (imply
:parameters (?f − floor) (exists
:precondition (and (lift−at ?f) (?p − never−alone)
(imply (or (and (origin ?p ?f)
(exists (not (served ?p)))
(?p − conflict−A) (and (boarded ?p)
(or (and (not (served ?p)) (not (destin ?p ?f)))))
(origin ?p ?f)) (exists
(and (boarded ?p) (?q − attendant)
(not (destin ?p ?f))))) (or (and (boarded ?q)
(forall (not (destin ?q ?f)))
(?q − conflict−B) (and (not (served ?q))
(and (or (destin ?q ?f) (origin ?q ?f)))))
(not (boarded ?q))) (forall
(or (served ?q) (?p − going−nonstop)
(not (origin ?q ?f)))))) (imply (boarded ?p) (destin ?p ?f)))
(imply (exists (or (forall
(?p − conflict−B) (?p − vip) (served ?p))
(or (and (not (served ?p)) (exists
(origin ?p ?f)) (?p − vip)
(and (boarded ?p) (or (origin ?p ?f) (destin ?p ?f))))
(not (destin ?p ?f))))) (forall
(forall (?p − passenger)
(?q − conflict−A) (imply
(and (or (destin ?q ?f) (no−access ?p ?f) (not (boarded ?p)))))
(not (boarded ?q))) )
(or (served ?q)
(not (origin ?q ?f))))))
Planning Domain Description Language

Question: What is PDDL good for?
(A) Nothing.
(B) Free beer.

(C) Those AI planning guys.
(D) Being lazy at work.
19.7 Conclusion
Summary
General problem solving attempts to develop solvers that perform well across a large
class of problems.
Planning, as considered here, is a form of general problem solving dedicated to the

class of classical search problems. (Actually, we also address inaccessible, stochastic,
dynamic, continuous, and multi-agent settings.)
Heuristic search planning has dominated the International Planning Competition
(IPC). We focus on it here.
STRIPS is the simplest possible, while reasonably expressive, language for our pur-
poses. It uses Boolean variables (facts), and defines actions in terms of precondition,
add list, and delete list.
PDDL is the de-facto standard language for describing planning problems.
Plan existence (bounded or not) is PSPACE-complete to decide for STRIPS. If we

bound plans polynomially, we get down to NP-completeness.
Suggested Reading:
• Chapters 10: Classical Planning and 11: Planning and Acting in the Real World in [RN09].
– Although the book is named “A Modern Approach”, the planning section was written long
before the IPC was even dreamt of, before PDDL was conceived, and several years before
heuristic search hit the scene. As such, what we have right now is the attempt of two outsiders
trying in vain to catch up with the dramatic changes in planning since 1995.
– Chapter 10 is Ok as a background read. Some issues are, imho, misrepresented, and it’s far
from being an up-to-date account. But it’s Ok to get some additional intuitions in words
different from my own.
– Chapter 11 is useful in our context here because we don’t cover any of it. If you’re interested
in extended/alternative planning paradigms, do read it.
• A good source for modern information (some of which we covered in the lecture) is Jörg
Hoffmann’s Everything You Always Wanted to Know About Planning (But Were Afraid to
Ask) [Hof11] which is available online at http://fai.cs.uni-saarland.de/hoffmann/papers/
ki11.pdf
Chapter 20
Planning II: Algorithms
20.1 Introduction
Reminder: Our Agenda for This Topic

chapter 19: Background, planning languages, complexity.
Sets up the framework. computational complexity is essential to distinguish
different algorithmic problems, and for the design of heuristic functions.
This Chapter: How to automatically generate a heuristic function, given planning

language input?
Focussing on heuristic search as the solution method, this is the main question
that needs to be answered.
Reminder: Search Search

General
Reminder: Search
From the initial state, produce all successive states step
Starting
I Starting by at initial
step  state,
searchproduce
tree.all successor states step by step:
at initial state, produce all successor states step by step:
(a) initial state (3,3,1)
(b) after expansion (3,3,1)

of (3,3,1)
(2,3,0) (3,2,0) (2,2,0) (1,3,0) (3,1,0)
(c) after expansion (3,3,1)

of (3,2,0)
(2,3,0) (3,2,0) (2,2,0) (1,3,0) (3,1,0)
(3,3,1)
03/23
In planning, this is referred to as forward search, or forward state-space search.
377
378 CHAPTER 20. PLANNING II: ALGORITHMS
In planning, this is referred to as forward search, or forward state-space search.
Search in the State Space?

Search in the State Space?
Use heuristic function to guide the search towards the goal!

I Use heuristic function to guide the search towards the goal!

Reminder: Informed Search
cos
t es
tim
ate
h
cost esti
mate h goal
init
mate h
cost esti
h
ate
e s tim
t
cos
Heuristic function h estimates the cost of an optimal path from a state s to the
goal; search prefers to expand states s with small h(s).
Live Demo vs. Breadth-First Search:
http://qiao.github.io/PathFinding.js/visual/

20.2. HOW TO RELAX 379
Reminder: Heuristic Functions

Definition 20.1.1. Let Π be a STRIPS task with states S. A heuristic function,
short heuristic, for Π is a function h : S→N ∪ {∞} so that h(s) = 0 whenever s is
a goal state.
Exactly like our definition from chapter 8. Except, because we assume unit costs
here, we use N instead of R+ .
Definition 20.1.2. Let Π be a STRIPS task with states S. The perfect heuristic
h∗ assigns every s∈S the length of a shortest path from s to a goal state, or ∞ if
no such path exists. A heuristic function h for Π is admissible if, for all s∈S, we
have h(s)≤h∗ (s).
Exactly like our definition from chapter 8, except for path length instead of path
cost (cf. above).
In all cases, we attempt to approximate h∗ (s), the length of an optimal plan for s.
Some algorithms guarantee to lower bound h∗ (s).
Our (Refined) Agenda for This Chapter

How to Relax: How to relax a problem?
Basic principle for generating heuristic functions.
The Delete Relaxation: How to relax a planning problem?
The delete relaxation is the most successful method for the automatic generation
of heuristic functions. It is a key ingredient to almost all IPC winners of the last
decade. It relaxes STRIPS tasks by ignoring the delete lists.
The h+ Heuristic: What is the resulting heuristic function?
h+ is the “ideal” delete relaxation heuristic.

Approximating h+ : How to actually compute a heuristic?
Turns out that, in practice, we must approximate h+ .
20.2 How to Relax in Planning

will now instantiate our general knowledge about heuristic search to the planning domain. As
always, the main problem is to find good heuristics. We will follow the intuitions of our discussion in
subsection 8.5.4 and consider full solutions to relaxed problems as a source for heuristics.
Reminder: Heuristic Functions from Relaxed Problems

Problem Π: Find a route from Saarbrücken to Edinburgh.
Relaxed Problem Π′ : Throw away the map.

Heuristic function h: Straight line distance.
Relaxation in Route-Finding
Problem class P: Route finding.
Perfect heuristic h∗ P for P: Length of a shortest route.

Simpler problem class P ′ : Route finding on an empty map.
Perfect heuristic h∗ P ′ for P ′ : Straight-line distance.
Transformation R: Throw away the map.
How to Relax in Planning? (A Reminder!)
Example 20.2.1 (Logistics).

facts P : {truck(x)|x∈{A, B, C, D}} ∪ {pack(x)|x∈{A, B, C, D, T }}.

initial state I: {truck(A), pack(C)}.
goal G: {truck(A), pack(D)}.
actions A: (Notated as “precondition ⇒ adds, ¬ deletes”)
drive(x, y), where x and y have a road: “truck(x) ⇒ truck(y), ¬truck(x)”.
load(x): “truck(x), pack(x) ⇒ pack(T ), ¬pack(x)”.
unload(x): “truck(x), pack(T ) ⇒ pack(x), ¬pack(T )”.
Example 20.2.2 (“Only-Adds” Relaxation). Drop the preconditions and deletes.

“drive(x, y): ⇒ truck(y)”;
“load(x): ⇒ pack(T )”;
“unload(x): ⇒ pack(x)”.
Heuristic value for I is?

hR (I) = 1: A plan for the relaxed task is ⟨unload(D)⟩.
We will start with a very simple relaxation, which could be termed “positive thinking”: we do not
consider preconditions of actions and leave out the delete lists as well.
How to Relax During Search: Overview

Attention: Search uses the real (un-relaxed) Π. The relaxation is applied (e.g.,
in Only-Adds, the simplified actions are used) only within the call to h(s)!!!
Problem Π Heuristic Search on Π Solution to Π
state s h(s) = h∗P ′ (R(Πs ))
R h∗ P ′
R(Πs )
Here, Πs is Π with initial state replaced by s, i.e., Π:=⟨P , A, I , G⟩ changed to

Πs :=⟨P , A, {s}, G⟩: The task of finding a plan for search state s.
A common student mistake is to instead apply the relaxation once to the whole
problem, then doing the whole search “within the relaxation”.
The next slide illustrates the correct search process in detail.

How to Relax During Search: Only-Adds

Real problem:
Initial state I: AC; goal G: AD.
Actions A: pre, add, del.
drXY, loX, ulX.
Relaxed problem:
State s: AC; goal G: AD.
Actions A: add.
hR (s) =1: ⟨ulD⟩.
Real problem:
State s: BC; goal G: AD.
drAB
AC −−−−→ BC.
Relaxed problem:
Actions A: add.
hR (s) =2: ⟨drBA, ulD⟩.
Real problem:
State s: CC; goal G: AD.
drBC
BC −−−−→ CC.
Relaxed problem:
Actions A: add.
Real problem:
drBA
BC −−−−→ AC.
Real problem:
Duplicate state, prune.
Real problem:
State s: DC; goal G: AD.
drCD
CC −−−−→ DC.
Relaxed problem:
State s: DC; goal G: AD.
Actions A: add.
Real problem:
State s: CT ; goal G: AD.
loC
CC −−→ CT .
Relaxed problem:
State s: CT ; goal G: AD.
Actions A: add.
Real problem:
drCB
CC −−−−→ BC.
Greedy best-first search: (tie-breaking: alphabetic)
1 ulA 1 drAB 2 drBC 2

dr l dr
AT ABAA oA BA B ACA
2 D D D
A
drB
We are here B BB BT AT AA
2 ul
2 drBC D
D B
rC DCdrC BT CT
1 drAB 2 drBC 2 dloC 2 drCD 2
dr dr u
AC BC B ACC C BCT lC DT
D D D
AC BC CC
Only-Adds is a “Native” Relaxation

Definition 20.2.3 (Native Relaxations). Confusing special case where P ′ ⊆ P.
P N ∪ {∞}
h∗ P
P′ ⊆ P h∗ P ′
R
Problem class P: STRIPS tasks.

Perfect heuristic h∗ P for P: Length h∗ of a shortest plan.
Transformation R: Drop the preconditions and delete lists.
Simpler problem class P ′ is a special case of P, P ′ ⊆ P: STRIPS tasks with
empty preconditions and delete lists.
Perfect heuristic for P ′ : Shortest plan for only-adds STRIPS task.
20.3 The Delete Relaxation

We turn to a more realistic relaxation, where we only disregard the delete list.
How the Delete Relaxation Changes the World

Relaxation mapping R saying that:
20.3. DELETE RELAXATION 387
“When the world changes, its previous state remains true as well.”
Real world: (before)
Real world:
(after)
Relaxed
world: (before)
Relaxed
world: (after)
Real
world: (before)
Real world: (after)
Relaxed world: (before)

Relaxed world: (after)
Real world:
Relaxed world:
The Delete Relaxation

Definition 20.3.1 (Delete Relaxation). Let Π:=⟨P , A, I , G⟩ be a STRIPS task.
The delete relaxation of Π is the task Π+ = ⟨P , A+ , I, G⟩ where A+ :={a+ |a∈A}
with prea+ :=prea , adda+ :=adda , and dela+ :=∅.
In other words, the class of simpler problems P ′ is the set of all STRIPS tasks with
empty delete lists, and the relaxation mapping R drops the delete lists.
Definition 20.3.2 (Relaxed Plan). Let Π:=⟨P , A, I , G⟩ be a STRIPS task, and
let s be a state. A relaxed plan for s is a plan for ⟨P , A, s, G⟩+ . A relaxed plan for
I is called a relaxed plan for Π.
A relaxed plan for s is an action sequence that solves s when pretending that all
delete lists are empty.
Also called “delete-relaxed plan”: “relaxation” is often used to mean “delete-relaxation”
by default.
A Relaxed Plan for “TSP” in Australia
1. Initial state: {at(Sy), vis(Sy)}.

+
2. drv(Sy, Br) : {at(Br), vis(Br), at(Sy), vis(Sy)}.
+
3. drv(Sy, Ad) : {at(Ad), vis(Ad), at(Br), vis(Br), at(Sy), vis(Sy)}.
+
4. drv(Ad, Pe) : {at(Pe), vis(Pe), at(Ad), vis(Ad), at(Br), vis(Br), at(Sy), vis(Sy)}.
+
5. drv(Ad, Da) : {at(Da), vis(Da), at(Pe), vis(Pe), at(Ad), vis(Ad), at(Br), vis(Br), at(Sy), vis(Sy)}.
A Relaxed Plan for “Logistics”
Facts P : {truck(x)|x∈{A, B, C, D}} ∪ {pack(x)|x∈{A, B, C, D, T }}.

Initial state I: {truck(A), pack(C)}.
Goal G: {truck(A), pack(D)}.
Relaxed actions A+ : (Notated as “precondition ⇒ adds”)

+
drive(x, y) : “truck(x) ⇒ truck(y)”.
+
load(x) : “truck(x), pack(x) ⇒ pack(T )”.
+
unload(x) : “truck(x), pack(T ) ⇒ pack(x)”.
Relaxed plan:
+ + + + +
⟨drive(A, B) , drive(B, C) , load(C) , drive(C, D) , unload(D) ⟩
We don’t need to drive the truck back, because “it is still at A”.
PlanEx+
Definition 20.3.3 (Relaxed Plan Existence Problem). By PlanEx+ , we denote
the problem of deciding, given a STRIPS task Π:=⟨P , A, I , G⟩, whether or not there
exists a relaxed plan for Π.
This is easier than PlanEx for general STRIPS!

PlanEx+ is in P.
Proof: The following algorithm decides PlanEx+
1.
var F := I
while G ̸⊆ F doS
F ′ := F ∪ a∈A:prea ⊆F adda
if F ′ = F then return ‘‘unsolvable’’ endif (∗)
F := F ′
endwhile
return ‘‘solvable’’
2. The algorithm terminates after at most |P | iterations, and thus runs in poly-
nomial time.
3. Correctness: See slide 639
Deciding PlanEx+ in “TSP” in Australia
Iterations on F :
1. {at(Sy), vis(Sy)}
2. ∪ {at(Ad), vis(Ad), at(Br), vis(Br)}
3. ∪ {at(Da), vis(Da), at(Pe), vis(Pe)}
Deciding PlanEx+ in “Logistics”

Example 20.3.4 (The solvable Case).
Iterations on F :
1. {truck(A), pack(C)}
2. ∪{truck(B)}
3. ∪{truck(C)}
4. ∪{truck(D), pack(T )}
5. ∪{pack(A), pack(B), pack(D)}
Example 20.3.5 (The unsolvable Case).
Iterations on F :
1. {truck(A), pack(C)}
2. ∪{truck(B)}
3. ∪{truck(C)}
4. ∪{pack(T )}
5. ∪{pack(A), pack(B)}
6. ∪∅
PlanEx+ Algorithm: Proof

Proof: To show: The algorithm returns “solvable” iff there is a relaxed plan for Π.
1. Denote by Fi the content of F after the ith iteration of the while-loop,
2. All a∈A0 are applicable in I, all a∈A1 are applicable in apply(I, A+ 0 ), and so forth.
3. Thus Fi = apply(I, ⟨A+ 0 , . . . , A +
i−1 ⟩). (Within each A +
j , we can sequence the
actions in any order.)
4. Direction “⇒” If “solvable” is returned after iteration n then G ⊆ Fn = apply(I, ⟨A+ +
0 , . . . , An−1 ⟩)
so ⟨A+ +
0 , . . . , An−1 ⟩ can be sequenced to a relaxed plan which shows the claim.
5. Direction “⇐”
5.1. Let ⟨a+ + + +
0 , . . . , an−1 ⟩ be a relaxed plan, hence G ⊆ apply(I, ⟨a0 , . . . , an−1 ⟩).
5.2. Assume, for the moment, that we drop line (*) from the algorithm. It is then
easy to see that ai ∈Ai and apply(I, ⟨a+ +
0 , . . . , ai−1 ⟩) ⊆ Fi , for all i.
5.3. We get G ⊆ apply(I, ⟨a+ +
0 , . . . , an−1 ⟩) ⊆ Fn , and the algorithm returns “solv-
able” as desired.
5.4. Assume to the contrary of the claim that, in an iteration i < n, (*) fires.
Then G̸⊆F and F = F ′ . But, with F = F ′ , F = Fj for all j > i, and we get
G̸⊆Fn in contradiction.
20.4 The h+ Heuristic

Hold on a Sec – Where are we?

P N ∪ {∞}
h∗ P
P′ ⊆ P h∗ P ′
R
P: STRIPS tasks; h∗ P : Length h∗ of a shortest plan.

P ′ ⊆ P: STRIPS tasks with empty delete lists.
R: Drop the delete lists.
Heuristic function: Length of a shortest relaxed plan (h∗ ◦ R).
PlanEx+ is not actually what we’re looking for. PlanEx+ =

b relaxed plan exis-
tence; we want relaxed plan length h∗ ◦ R.
h+ : The Ideal Delete Relaxation Heuristic

Definition 20.4.1 (Optimal Relaxed Plan). Let ⟨P , A, I , G⟩ be a STRIPS
task, and let s be a state. A optimal relaxed plan for s is an optimal plan for
⟨P , A, {s}, G⟩+ .
Same as slide 633, just adding the word “optimal”.
Here’s what we’re looking for:
Definition 20.4.2. Let Π:=⟨P , A, I , G⟩ be a STRIPS task with states S. The

ideal delete relaxation heuristic h+ for Π is the function h+ : S→N ∪ {∞} where
h+ (s) is the length of an optimal relaxed plan for s if a relaxed plan for s exists,
and h+ (s) = ∞ otherwise.
In other words, h+ = h∗ ◦ R, cf. previous slide.
20.4. THE h+ HEURISTIC 393
h+ is Admissible
Lemma 20.4.3. Let Π:=⟨P , A, I , G⟩ be a STRIPS task, and let s be a state. If
⟨a1 , . . ., an ⟩ is a plan for Πs :=⟨P , A, {s}, G⟩, then ⟨a+ + +
1 , . . ., an ⟩ is a plan for Π .
Proof sketch: Show by induction over 0≤i≤n that

apply(s, ⟨a1 , . . . , ai ⟩) ⊆ apply(s, ⟨a+ +
1 , . . . , ai ⟩).
If we ignore deletes, the states along the plan can only get bigger.
Theorem 20.4.4. h+ is Admissible.
Proof:
1. Let Π:=⟨P , A, I , G⟩ be a STRIPS task with states P , and let s∈P .
s .
2. h+ (s) is defined as optimal plan length in Π+
3. With the lemma above, any plan for Π also constitutes a plan for Π+ s .
4. Thus optimal plan length in Π+ s can only be shorter than that in Πs i, and the
claim follows.
How to Relax During Search: Ignoring Deletes

20.4. THE h+ HEURISTIC 395
Real problem:
Initial state I: AC; goal G:
AD.
drXY, loX, ulX.
Relaxed problem:
Actions A: pre, add.

h+ (s) =5: e.g.
⟨drAB, drBC, drCD, loC, ulD⟩.
Real problem:

drAB
AC −−−−→ BC.
Relaxed problem:
h+ (s) =5: e.g.

⟨drBA, drBC, drCD, loC, ulD⟩.
Real problem:

drBC
BC −−−−→ CC.
Relaxed problem:

h+ (s) =5: e.g.
⟨drCB, drBA, drCD, loC, ulD⟩.
Real problem:
drBA
BC −−−−→ AC.
Real problem:
Greedy best-first search: (tie-breaking: alphabetic)
4 ulA 5
dr
AT ABAA
5 D
A
drB
We are here B BB BT
5 ul
4 drBC D
D B
rC DCdrC BT CT
5 drAB 5 drBC 5 dloC 4 drCD 4 ulD 3 drDC 2 drCB 1 drBA 0
dr dr u dr l dr dr
AC BC B ACC C BCT lC DT DCDD oD CD C DBD B CAD
D D D D D D D
AC BC CC CT DT DD CD
Of course there are also bad cases. Here is one.
h+ in the Blocksworld
A
A
B B
D C C
Optimal plan: ⟨putdown(A), unstack(B, D), stack(B, C), pickup(A), stack(A, B)⟩.
Optimal relaxed plan: ⟨stack(A, B), unstack(B, D), stack(B, C)⟩.
Observation: What can we say about the “search space surface” at the initial
state here?
The initial state lies on a local minimum under h+ , together with the successor
state s where we stacked A onto B. All direct other neighbours of these two states
have a strictly higher h+ value.
20.5 Conclusion
Summary
Heuristic search on classical search problems relies on a function h mapping states
s to an estimate h(s) of their goal distance. Such functions h are derived by solving
relaxed problems.
In planning, the relaxed problems are generated and solved automatically. There
are four known families of suitable relaxation methods: abstractions, landmarks,
critical paths, and ignoring deletes (aka delete relaxation).

The delete relaxation consists in dropping the deletes from STRIPS tasks. A relaxed
plan is a plan for such a relaxed task. h+ (s) is the length of an optimal relaxed plan
for state s. h+ is NP-hard to compute.
hFF approximates h+ by computing some, not necessarily optimal, relaxed plan.

That is done by a forward pass (building a relaxed planning graph), followed by a
backward pass (extracting a relaxed plan).

Abstractions, Landmarks, Critical-Path Heuristics, Cost Partitions, Compil-
ability between Heuristic Functions, Planning Competitions:
Tractable fragments: Planning sub-classes that can be solved in polynomial time.
Often identified by properties of the “causal graph” and “domain transition graphs”.
Planning as SAT: Compile length-k bounded plan existence into satisfiability of

a CNF formula φ. Extensive literature on how to obtain small φ, how to schedule
different values of k, how to modify the underlying SAT solver.
Compilations: Formal framework for determining whether planning formalism X
is (or is not) at least as expressive as planning formalism Y .
Admissible pruning/decomposition methods: Partial-order reduction, symme-
try reduction, simulation-based dominance pruning, factored planning, decoupled
search.
Hand-tailored planning: Automatic planning is the extreme case where the com-
puter is given no domain knowledge other than “physics”. We can instead allow the
user to provide search control knowledge, trading off modeling effort against search
performance.
Numeric planning, temporal planning, planning under uncertainty . . .
Suggested Reading (RN: Same As Previous Chapter):
• Chapters 10: Classical Planning and 11: Planning and Acting in the Real World in [RN09].
– Although the book is named “A Modern Approach”, the planning section was written long
before the IPC was even dreamt of, before PDDL was conceived, and several years before
heuristic search hit the scene. As such, what we have right now is the attempt of two outsiders
trying in vain to catch up with the dramatic changes in planning since 1995.
– Chapter 10 is Ok as a background read. Some issues are, imho, misrepresented, and it’s far
from being an up-to-date account. But it’s Ok to get some additional intuitions in words
different from my own.
– Chapter 11 is useful in our context here because we don’t cover any of it. If you’re interested
in extended/alternative planning paradigms, do read it.
• A good source for modern information (some of which we covered in the lecture) is Jörg
Hoffmann’s Everything You Always Wanted to Know About Planning (But Were Afraid to
Ask) [Hof11] which is available online at http://fai.cs.uni-saarland.de/hoffmann/papers/
ki11.pdf
Chapter 21
Searching, Planning, and Acting in

the Real World
Outline
So Far: we made idealizing/simplifying assumptions:
The environment is fully observable and deterministic.
Outline: In this chapter we will lift some of them

The real world (things go wrong)
Agents and Belief States
Conditional planning
Monitoring and replanning
Note: The considerations in this chapter apply to both search and planning.
21.1 Introduction
The real world

Example 21.1.1. We have a flat tire – what to do?
399
400 CHAPTER 21. SEARCHING, PLANNING, AND ACTING IN THE REAL WORLD
Generally: Things go wrong (in the real world)

Example 21.1.2 (Incomplete Information).
Unknown preconditions, e.g., Intact(Spare)?
Disjunctive effects, e.g., Inf late(x) causes Inf lated(x)∨SlowHiss(x)∨Burst(x)∨
BrokenP ump ∨ . . .
Example 21.1.3 (Incorrect Information).

Current state incorrect, e.g., spare NOT intact
Missing/incorrect effects in actions.
Definition 21.1.4. The qualification problem in planning is that we can never finish
listing all the required preconditions and possible conditional effects of actions.
Root Cause: The environment is partially observable and/or non-deterministic.
Technical Problem: We cannot know the “current state of the world”, but search/-
planning algorithms are based on this assumption.
Idea: Adapt search/planning algorithms to work with “sets of possible states”.
What can we do if things (can) go wrong?

One Solution: Sensorless planning: plans that work regardless of state/outcome.
Problem: Such plans may not exist! (but they often do in practice)
Another Solution: Conditional plans:
Plan to obtain information, (observation actions)
Subplan for each contingency.
21.2. THE FURNITURE COLORING EXAMPLE 401
Example 21.1.5 (A conditional Plan). (AAA =

b ADAC)
[Check(T 1), if Intact(T 1) then Inf late(T 1) else CallAAA fi]
Problem: Expensive because it plans for many unlikely cases.
Still another Solution: Execution monitoring/replanning

Assume normal states/outcomes, check progress during execution, replan if nec-
essary.
Problem: Unanticipated outcomes may lead to failure. (e.g., no AAA card)
Observation 21.1.6. We really need a combination; plan for likely/serious even-

tualities, deal with others when they arise, as they must eventually.
21.2 The Furniture Coloring Example

We now introduce a planning example that shows off the various features.
The Furniture-Coloring Example: Specification

Example 21.2.1 (Coloring Furniture).
Paint a chair and a table in matching colors.
The initial state is:

we have two cans of paint of unknown color,
the color of the furniture is unknown as well,
only the table is in the agent’s field of view.
Actions:
remove lid from can
paint object with paint from open can.
We formalize the example in PDDL for simplicity. Note that the :percept scheme is not part of
the official PDDL, but fits in well with the design.
The Furniture-Coloring Example: PDDL

Example 21.2.2 (Formalization in PDDL).
The PDDL domain file is as expected (actions below)

(define (domain furniture−coloring)
(:predicates (object ?x) (can ?x) (inview ?x) (color ?x ?y))
...)
The PDDL problem file has a “free” variable ?c for the (undetermined) joint
color.
(define (problem tc−coloring)
(:domain furniture−objects)
(:objects table chair c1 c2)
(:init (object table) (object chair) (can c1) (can c2) (inview table))
(:goal (color chair ?c) (color table ?c)))
Two action schemata: remove can lid to open and paint with open can
(:action remove−lid
:parameters (?x)
:precondition (can ?x)
:effect (open can))
(:action paint
:parameters (?x ?y)
:precondition (and (object ?x) (can ?y) (color ?y ?c) (open ?y))
:effect (color ?x ?c))
has a universal variable ?c for the paint action ⇝ we cannot just give paint a
color argument in a partially observable environment.
Sensorless Plan: Open one can, paint chair and table in its color.
Note: Contingent planning can create better plans, but needs perception
Two percept schemata: color of an object and color in a can
(:percept color
:parameters (?x ?c)
:precondition (and (object ?x) (inview ?x)))
(:percept can−color
:parameters (?x ?c)
:precondition (and (can ?x) (inview ?x) (open ?x)))
To perceive the color of an object, it must be in view, a can must also be open.
Note: In a fully observable world, the percepts would not have preconditions.
An action schema: look at an object that causes it to come into view.
(:action lookat
:parameters (?x)
:precond: (and (inview ?y) and (notequal ?x ?y))
:effect (and (inview ?x) (not (inview ?y))))
Contingent Plan:
1. look at furniture to determine color, if same ; done.
2. else, look at open and look at paint in cans
3. if paint in one can is the same as an object, paint the other with this color
4. else paint both in any color
21.3 Searching/Planning with Non-Deterministic Actions

21.3. SEARCHING/PLANNING WITH NON-DETERMINISTIC ACTIONS 403
Conditional Plans
Definition 21.3.1. Conditional plans extend the possible actions in plans by condi-
tional steps that execute sub plans conditionally whether K + P |= C, where K + P
is the current knowledge base + the percepts.
Example 21.3.2. [. . . , if C then P lanA else P lanB fi, . . .]

Definition 21.3.3. If the possible percepts are limited to determining the current
state in a conditional plan, then we speak of a contingency plan.
Note: Need some plan for every possible percept! Compare to
game playing: some response for every opponent move.

backchaining: some rule such that every premise satisfied.
Idea: Use an AND–OR tree search (very similar to backward chaining algorithm)
Contingency Planning: The Erratic Vacuum Cleaner
Example 21.3.4 (Erratic vacuum world).
Suck Right
A variant suck action:

if square is 7 5 2
GOAL Suck Right Left Suck

dirty: clean the square,
sometimes remove dirt in
adjacent square.
5 1 6 1 8 4
clean: sometimes deposits LOOP LOOP Suck Left LOOP GOAL

dirt on the carpet.
8 5
GOAL LOOP
Solution: [suck, if State = 5 then [right, suck] else [] fi]
Conditional AND/OR Search (Data Structure)

Idea: Use OR trees as data structures for representing problems (or goals) that
can be reduced to to conjunctions and disjunctions of subproblems (or subgoals).
Definition 21.3.5. An OR graph is a is a graph whose non-terminal nodes are
partitioned into AND nodes and OR nodes. A valuation of an OR graph T is an

assignment of T or F to the nodes of T . A valuation of the terminal nodes of T
can be extended by all nodes recursively: Assign T to an
OR node, iff at least one of its children is T.
AND node, iff all of its children are T.
A solution for T is a valuation that assigns T to the initial nodes of T .
Idea: A planning task with non deterministic actions generates a OR graph T . A
solution that assigns T to a terminal node, iff it is a goal node. Corresponds to a
conditional plan.
Conditional AND/OR Search (Example)

Definition 21.3.6. An OR tree is a OR graph that is also a tree.
Notation: AND nodes are written with arcs connecting the child edges.
Example 21.3.7 (An AND/OR-tree).
Conditional AND/OR Search (Algorithm)

Definition 21.3.8. OR search is an algorithm for searching AND–OR graphs gen-
erated by nondeterministic environments.
function AND/OR−GRAPH−SEARCH(prob) returns a conditional plan, or fail
OR−SEARCH(prob.INITIAL−STATE, prob, [])
function OR−SEARCH(state,prob,path) returns a conditional plan, or fail
if prob.GOAL−TEST(state) then return the empty plan
if state is on path then return fail
for each action in prob.ACTIONS(state) do
plan := AND−SEARCH(RESULTS(state,action),prob,[state | path])
if plan ̸= fail then return [action | plan]
return fail
21.4. AGENT ARCHITECTURES BASED ON BELIEF STATES 405
function AND−SEARCH(states,prob,path) returns a conditional plan, or fail

for each si in states do
pi := OR−SEARCH(si ,prob,path)
if pi = fail then return fail
return [if s1 then p1 else if s2 then p2 else . . . if sn−1 then pn−1 else pn ]
Cycle Handling: If a state has been seen before ; fail

fail does not mean there is no solution, but
if there is a non-cyclic solution, then it is reachable by an earlier incarnation!
The Slippery Vacuum Cleaner (try, try, try, . . . try again)
Example 21.3.9 (Slippery Vacuum World).
Suck Right
Moving sometimes fails 5 2
; OR graph Right
Two possible solutions (depending on what our plan language allows)
[L1 : lef t, if AtR then L1 else [if CleanL then ∅ else suck fi] fi] or
[while AtR do [lef t] done, if CleanL then ∅ else suck fi]
We have an infinite loop but plan eventually works unless action always fails.
21.4 Agent Architectures based on Belief States

We are now ready to proceed to environments which can only partially observed and where are
our actions are non deterministic. Both sources of uncertainty conspire to allow us only partial
knowledge about the world, so that we can only optimize “expected utility” instead of “actual
utility” of our actions.
World Models for Uncertainty

Problem: We do not know with certainty what state the world is in!
Idea: Just keep track of all the possible states it could be in.
Definition 21.4.1. A model based agent has a world model consisting of
a belief state that has information about the possible states the world may be
in, and
a sensor model that updates the belief state based on sensor information
a transition model that updates the belief state based on actions.
Idea: The agent environment determines what the world model can be.
In a fully observable, deterministic environment,
we can observe the initial state and subsequent states are given by the actions
alone.
thus the belief state is a singleton set (we call its member the world state) and
the transition model is a function from states and actions to states: a transition
function.
That is exactly what we have been doing until now: we have been studying methods that
build on descriptions of the “actual” world, and have been concentrating on the progression from
atomic to factored and ultimately structured representations. Tellingly, we spoke of “world states”
instead of “belief states”; we have now justified this practice in the brave new belief-based world
models by the (re-) definition of “world states” above. To fortify our intuitions, let us recap from
a belief-state-model perspective.
World Models by Agent Type in AI-1

Note: All of these considerations only give requirements to the world model.
What we can do with it depends on representation and inference.
Search-based Agents: In a fully observable, deterministic environment

goal based agent with world state =
b “current state”
no inference. (goal =
b goal state from search problem)
CSP-based Agents: In a fully observable, deterministic environment
goal based agent withworld state =

b constraint network,
inference =
b constraint propagation. (goal =
b satisfying assignment)
Logic-based Agents: In a fully observable, deterministic environment
model based agent with world state =

b logical formula
inference =
b e.g. DPLL or resolution. (no decision theory covered in AI-1)
Planning Agents: In a fully observable, deterministic, environment
b PL0, transition model =
b STRIPS,
inference =
b state/plan space search. (goal: complete plan/execution)
21.5. SEARCHING/PLANNING WITHOUT OBSERVATIONS 407
Let us now see what happens when we lift the restrictions of total observability and determin-
ism.
World Models for Complex Environments

In a fully observable, but stochastic environment,
the belief state must deal with a set of possible states.
; generalize the transition function to a transition relation.
Note: This even applies to online problem solving, where we can just perceive
the state. (e.g. when we want to optimize utility)
In a deterministic, but partially observable environment,
we can use transition functions.
We need a sensor model, which predicts the influence of percepts on the belief
state – during update.
In a stochastic, partially observable environment,
mix the ideas from the last two. (sensor model + transition relation)
Preview: New World Models (Belief) ; new Agent Types

Probabilistic Agents: In a partially observable environment
belief state =
b Bayesian networks,
inference =
b probabilistic inference.
Decision-Theoretic Agents:
In a partially observable, stochastic environment
belief state + transition model =
b decision networks,
inference =
b maximizing expected utility.
We will study them in detailthis semester.
21.5 Searching/Planning without Observations

Conformant/Sensorless Planning
Definition 21.5.1. Conformant or sensorless planning tries to find plans that work
without any sensing. (not even the initial state)
Example 21.5.2 (Sensorless Vacuum Cleaner World).
States integer dirt and robot locations

Actions lef t, right, suck, noOp
Goal tests notdirty?
Observation 21.5.3. In a sensorless world we do not know the initial state. (or
any state after)
Observation 21.5.4. Sensorless planning must search in the space of belief states
(sets of possible actual states).
Example 21.5.5 (Searching the Belief State Space).
Start in {1, 2, 3, 4, 5, 6, 7, 8}
Solution: [right, suck, lef t, suck] right → {2, 4, 6, 8}
suck → {4, 8}
lef t → {3, 7}
suck → {7}
Search in the Belief State Space: Let’s Do the Math

Recap: We describe an search problem Π:=⟨S , A, T , I , G ⟩ via its states S, actions
A, and transition model T : A×S→P(A), goal states G, and initial state I.
Problem: What is the corresponding sensorless problem?
Let’ think: Let Π:=⟨S , A, T , I , G ⟩ be a (physical) problem
States S b : The belief states are the 2|S| subsets of S.

The initial state I b is just S (no information)
Goal states G :={S ∈ S | S ⊆ G}
b b
(all possible states must be physical goal
states)
Actions Ab : we just take A. (that’s the point!)
Transition model T : A ×S →P(A ): i.e. what is T (a, S) for a∈A and
b b b b b
S ⊆ S? This is slightly tricky as a need not be applicable to all s∈S.

S
1. if actions are harmless to the environment, take T b (a, S):= s∈S T (a, s).
T
2. if not, better take T b (a, S):= s∈S T (a, s). (the safe bet)
Observation 21.5.6. In belief-state space the problem is always fully observable!
Let us see if we can understand the options for T b (a, S) a bit better. The first question is when we
want an action a to be applicable to a belief state S ⊆ S, i.e. when should T b (a, S) be non-empty.
21.5. SEARCHING/PLANNING WITHOUT OBSERVATIONS 409
In the first case, ab would be applicable iff a is applicable to some s∈S, in the second case if a is
applicable to all s∈S. So we only want to choose the first case if actions are harmless.
The second question we ask ourselves is what should be the results of applying a to S ⊆ S?,
again, if actions are harmless, we can just collect the results, otherwise, we need to make sure that
all members of the result ab are reached for all possible states in S.
State Space vs. Belief State Space

Example 21.5.7 (State/Belief State Space in the Vacuum World).
70 Chapter 3. Solving Problems by Searching
In the vacuum world all actions are always applicable (1./2. equal)
R
L R
L
S S
R R
L R L R
L L
S S
S S
R
L R
S S
Right, S = Suck.
3.2.1 Toy problems

The first example we examine is the vacuum world first introduced in Chapter 2. (See
• States: The state is determined by both the agent location and the dirt locations. The
there are 2 × 22 = 8 possible world states. A larger environment with n locations has
n · 2n states.
• Initial state: Any state can be designated as the initial state.
Suck. Larger environments might also include Up and Down.
have no effect. The complete state space is shown in Figure 3.3.
• Path cost: Each step costs 1, so the path cost is the number of steps in the path.
8-PUZZLE The 8-puzzle, an instance of which is shown in Figure 3.4, consists of a 3×3 board with
space. The object is to reach a specified goal state, such as the one shown on the right of the
figure. The Michael
standard formulation
Kohlhase: Artificial is as follows:
Intelligence 2 665 2023-02-10
Evaluating Conformant Planning

Upshot: We can build belief-space problem formulations automatically,
but they are exponentially bigger in theory, in practice they are often similar;
e.g. 12 reachable belief states out of 28 = 256 for vacuum example.
Problem: Belief states are HUGE; e.g. initial belief state for the 10 × 10 vacuum
world contains 100 · 2100 ≈ 1032 physical states

Idea: Use planning techniques: compact descriptions for
belief states; e.g. all for initial state or not leftmost column after Alef t.
actions as belief state to belief state operations.
This actually works: Therefore we talk about conformant planning!
21.6 Searching/Planning with Observation

Conditional planning (Motivation)

Note: So far, we have never used the agent’s sensors.
In chapter 8, since the environment was observable and deterministic we could

just use offline planning.
In section 21.5 because we chose to.
Note: If the world is nondeterministic or partially observable then percepts usually
provide information, i.e., split up the belief state
Idea: This can systematically be used in search/planning via belief-state search,

but we need to rethink/specialize the Transition model.
A Transition Model for Belief-State Search

We extend the ideas from slide 664 to include partial observability.
Definition 21.6.1. Given a (physical) sproblem Π:=⟨S , A, T , I , G ⟩, we define the
belief state search problem induced by Π to be ⟨P(S), A, T b , S, {S ∈ S b | S ⊆ G}⟩,
where the transition model T b is constructed in three stages:
The prediction stage: given a belief state b and an action a we define bb:=PRED(b, a)
for some function PRED : P(S)×A→P(S).
The observation prediction stage determines the set of possible percepts that
could be observed in the predicted belief state: PossPERC(bb) = {PERC(s)|s∈bb}.
21.6. SEARCHING/PLANNING WITH OBSERVATION 411
The update stage determines, for each possible percept, the resulting belief
state: UPDATE(bb, o):={s|o = PERC(s) and s∈bb}
The functions PRED and PERC are the main parameters of this model. We define
RESULT(b, a):={UPDATE(PRED(b, a), o)|PossPERC(PRED(b, a))}
Observation 21.6.2. We always have UPDATE(bb, o) ⊆ bb.

Observation 21.6.3. If sensing is deterministic, belief states for different possible
percepts are disjoint, forming a partition of the original predicted belief state.
Example: Local Sensing Vacuum Worlds

Example 21.6.4 (Transitions in the Vacuum World). Deterministic World:
[B,Dirty] 2
Right
1 2
(a)
3 4
[B,Dirty] 2
Right
1 2
[B,Clean] 4
(a)
3 4
[B,Clean] 4
The action Right is deterministic, sensing disambiguates to singletons Slippery
World:
2
[B,Dirty]
2
[B,Dirty]
Right 2
Right 2
1 1 [A,Dirty] 1
(b) 1 1 [A,Dirty] 1
(b)
3 33 3
3
3
44
[B,Clean]
[B,Clean]4
4
Figure 4.14 Two examples of transitions in local-sensing vacuum worlds. (a) In the deter-
The action Right is non-deterministic, sensing disambiguates somewhat
ministic world, Right is applied in the initial belief state, resulting in a new predicted belief
state with two possible physical states; for those states, the possible percepts are [R, Dirty]
ministic world,
and [R,Right
Clean],isleading
applied in belief
to two the initial belief
states, each state,isresulting
of which a singleton.in(b)a In
new predicted belief
the slippery
state with world,
two possible physical
Right is applied in thestates; for those
initial belief states,a the
state, giving new possible
belief statepercepts are [R, Dirty]
with four physi-
2
[B,Dirty]
Right 2
1 1 [A,Dirty] 1
(b)
3 3 3
412 CHAPTER 21. SEARCHING, PLANNING,

4
AND ACTING IN THE REAL WORLD
[B,Clean]
4
ministic world, Right is applied in the initial belief state, resulting in a new predicted belief
Belief-State Search with Percepts
state with two possible physical states; for those states, the possible percepts are [R, Dirty]
and [R, Clean], leading to two belief states, each of which is a singleton. (b) In the slippery
world, Right is The
Observation: applied in the initialtransition
belief-state belief state, givinginduces
model a new belief
an OR stategraph.
with four physi-
cal states; for those states, the possible percepts are [L, Dirty], [R, Dirty], and [R, Clean],
Idea: Use
leading OR search
to three in non
belief states deterministic environments.
as shown.
Example 21.6.5. OR graph for initial percept [A, Dirty].
3
Suck Right
A,Clean] B,Dirty] B,Clean]
5
2 4
7
36 Chapter 4 Search in Complex Environments
Figure 4.15 The first level of the AND – OR search tree for a problem in the local-sensing
Solution:
vacuum world;
[Suck, is theiffirst
Right,
Suck action=
Bstate in {6} then Suck else [] fi]
the solution.
Note: Belief-state-problem ; conditional step tests on belief-state percept (plan

would not be executable
Suck
in a partially
[A,Clean]
observableRight
environment
2
otherwise)
[B,Dirty]
1 5 5 6 2
3 7 7 4 6
Example: Agent Localization

8
Figure 4.16 Two prediction–update cycles of belief-state maintenance in the kindergarten

Example 21.6.6. An agent inhabits a maze of which it has an accurate map. It has
vacuum world with local sensing.
four sensors that can (reliably) detect walls. The M ove action is non-deterministic,
moving the agent randomly into one of the adjacent squares.
1. Initial belief state ; bb1 all possible locations.
2. Initial percept: N W S (walls north, west, and south) ; bb2 = UPDATE(bb1 , N W S)
(a) Possible locations of robot after E = 1011

3. Agent executes M ove ; bb3 = PRED(bb21 , M ove) = one step away from these.
4. Next percept: N S ; bb4 = UPDATE(bb3 , N S)
(b) Possible locations of robot after E1 = 1011, E2 = 1010

21.6. SEARCHING/PLANNING WITH OBSERVATION 413
(a) Possible locations of robot after E1 = 1011
(b) Possible locations of robot after E1 = 1011, E2 = 1010

All in all, bb4 = UPDATE(PRED(UPDATE(bb1 , N W S), M ove), N S) localizes the
agent. Figure 4.17 Possible positions of the robot, !, (a) after one observation, E1 = 1011, and
(b) after moving one square and making a second observation, E2 = 1010. When sensors are
Observation: PRED
noiseless and enlarges
the transition the
model belief state,
is accurate, while
there is only oneUPDATE shrinks
possible location itrobot
for the again.
consistent with this sequence of two observations.
Contingent Planning
Definition 21.6.7. The generation of plan with conditional branching based on
percepts is called contingent planning, solutions are called contingent plans.
Appropriate for partially observable or non-deterministic environments.
Example 21.6.8. Continuing Example 21.2.1.
One of the possible contingent plan is
((lookat table) (lookat chair)
(if (and (color table c) (color chair c)) (noop)
((removelid c1) (lookat c1) (removelid c2) (lookat c2)
(if (and (color table c) (color can c)) ((paint chair can))
(if (and (color chair c) (color can c)) ((paint table can))
((paint chair c1) (paint table c1)))))))
Note: Variables in this plan are existential; e.g. in
line 2: If there is come joint color c of the table and chair ; done.
line 4/5: Condition can be satisfied by [c1 /can] or [c2 /can] ; instantiate ac-
cordingly.
Definition 21.6.9. During plan execution the agent maintains the belief state b,
chooses the branch depending on whether b |= c for the condition c.
Note: The planner must make sure b |= c can always be decided.
Contingent Planning: Calculating the Belief State

Problem: How do we compute the belief state?
Recall: Given a belief state b, the new belief state bb is computed based on
prediction with the action a and the refinement with the percept p.
Here:
Given an action a and percepts p = p1 ∧ . . . ∧ pn , we have
bb = (b\dela ) ∪ adda (as for the sensorless agent)

If n = 1 and (:percept p1 :precondition c) is the only percept axiom, also add p
and c to bb. (add c as otherwise p impossible)
If n > 1 and (:percept pi :precondition ci ) are the percept axioms, also add p
and c1 ∨ . . . ∨ cn to bb. (belief state no longer conjunction of literals /)
Idea: Given such a mechanism for generating (exact or approximate) updated

belief states, we can generate contingent plans with an extension of OR search over
belief states.
Extension: This also works for non-deterministic actions: we extend the represen-
tation of effects to disjunctions.
21.7 Online Search

Online Search and Replanning

Note: So far we have concentrated on offline problem solving, where the agent
only acts (plan execution) after search/planning terminates.
Recall: In online problem solving an agent interleaves computation and action: it
computes one action at a time based on incoming perceptions.
Online problem solving is helpful in

dynamic or semidynamic environments. (long computation times can be
harmful)
stochastic environments. (solve contingencies only when they arise)
Online problem solving is necessary in unknown environments ; exploration prob-

lem.
Online Search Problems

Observation: Online problem solving even makes sense in deterministic, fully
observable environments.
Definition 21.7.1. A online search problem consists of a set S of states, and
a function Actions(s) that returns a list of actions allowed in state s.

the step cost function c, where c(s, a, s′ ) is the cost of executing action a in
state s with outcome s′ . (cost unknown before executing a)
a goal test Goal Test.
Note: We can only determine RESULT(s, a) by being in s and executing a.
21.7. ONLINE SEARCH 415
Definition 21.7.2. The competitive ratio of an online problem solving agent is the
quotient of
offline performance, i.e. cost of optimal solutions with full information and
online performance, i.e. the actual cost induced by online problem solving.
Online Search Problems (Example)

3 G 37
37
Example 21.7.3 (A simple maze problem).
2
The agent starts at S and must reach G but knows nothing

of the environment. In particular not that 3
1 S
G
Up(1, 1) results in (1,2) and 1 2 3

1 S
FigureG4.18 (i.e.
3
Down(1, 1) results in (1,1) back)
A simple maze problem. The agent starts at S and must reach G but
1 2 3
nothing of theFigure
environment.
4.18 A simple maze problem. The agent starts at S and must reach G but knows
nothing of the environment.
2
1 S
Online Search Obstacles (Dead1 Ends)

2 3
G
Figure 4.18 A simple maze problem. The agent starts at S and must reach G but knows
S A
nothing of the environment.
Definition 21.7.4. We call a state a dead end, iff no state is reachable from it by
an action. An action that leads to a dead end is called irreversible. G S G
Note: With irreversible actions the competitive ratio can be infinite.

S AS A
Observation 21.7.5. No online algorithm can avoid dead ends in all state
G spaces.
(a) (b)
Example 21.7.6. Two state spaces that lead an onlineFigure
agent4.19 (a)into
Two state dead ends:
spaces that might lead an online search agent into a dead end.
S
Any given agent will fail in at least one of these spaces. (b) A two-dimensional environment
that can cause an online search agent to follow an arbitrarily inefficient route to the goal.
Whichever choice the agent makes, the adversary blocks that route with another long, thin
wall, so that the path followed is much longer than the best possible path.
G
S A S A
G
S (a)
G (b)
Any agent will fail in at least one of the spaces.
Figure 4.19 (a) Two state spaces that might lead an online search agent into a de
Any given agent will fail in at least one of these spaces. (b) A two-dimensional envir
Definition 21.7.7. We call Example 21.7.6 that
an adversary
can cause anargument.
online search agent to follow an arbitrarily inefficient route to th
S A Whichever choice the agent makes, the adversary blocks that route with another lon
Example 21.7.8. Forcing an online agent into
wall,an
so arbitrarily
that the pathinefficient route:longer than the best possible path.
followed is much
G
(a) (b)
Figure 4.19 (a) Two state spaces that might lead an online search agent into a dead end.
that can cause an online search agent to follow an arbitrarily inefficient route to the goal.
wall, so that the path followed is much longer than the best possible path.
G
S A
S G
Whichever choice the agent makes
the adversary can block with a
long, thin wall
S A
G
(a) (b)
FigureDead
Observation: 4.19 (a) Two are
ends stateaspaces that might lead
real problem an online search
for robots: ramps, agent into acliffs,
stairs, dead end.
...
Definitionthat can causeAan state
21.7.9. online search
spaceagent to follow
is called an arbitrarily
safely inefficient
explorable, iff route to thestate
a goal goal. is
reachable from every
wall, so reachable
that the state.
path followed is much longer than the best possible path.
We will always assume this in the following.
Online Search Agents

Observation: Online and offline search algorithms differ considerably:
For an offline agent, the environment is visible a priori.
An online agent builds a “map” of the environment from percepts in visited
states.
Therefore, e.g. A∗ can expand any node in the fringe, but an online agent must go
there to explore it.
Intuition: It seems best to expand nodes in “local order” to avoid spurious travel.
Idea: Depth first search seems a good fit. (must only travel for backtracking)
Online DFS Search Agent

Definition 21.7.10. The :
function ONLINE−DFS−AGENT(s′ ) returns an action
inputs: s′ , a percept that identifies the current state
persistent: result, a table mapping (s, a) to s′ , initially empty
untried, a table mapping s to a list of untried actions
unbacktracked, a table mapping s to a list backtracks not tried
s, a, the previous state and action, initially null
if Goal-Test(s′ ) then return stop
if s′ ̸∈untried then untried[s′ ] := Actions(s′ )
if s is not null then
result[s, a] := s′
add s to the front of unbacktracked[s′ ]
if untried[s′ ] is empty then
if unbacktracked[s′ ] is empty then return stop
21.8. REPLANNING AND EXECUTION MONITORING 417
else a := an action b such that result[s′ , b] = pop(unbacktracked[s′ ])

else a := pop(untried[s′ ])
s := s′
return a
85
Note: result is the “environment map” constructed as the agent explores.
function A NGELIC -S EARCH ( problem, hierarchy , initialPlan ) returns solution or fail

frontier ← a FIFO queue with initialPlan as the only element

while true do
21.8 Replanning and
if E MPTY ?( frontier Execution
) then return fail Monitoring
plan ← P OP ( frontier ) // chooses the shallowest node in frontier
A Video Nugget covering
if R EACH + this section
(problem.I NITIALcan be found
, plan) at https://fau.tv/clip/id/29186.
intersects problem.G OAL then
if plan is primitive then return plan // R EACH + is exact for primitive plans
Replanning (Ideas) −
guaranteed ← R EACH (problem.I NITIAL, plan ) ∩ problem.G OAL
if guaranteed#={ } and M AKING -P ROGRESS (plan, initialPlan ) then
Idea: We finalState any element
can turn a←planner P into of an online problem solver by adding an action
guaranteed
return D (hierarchy
RePlan(g) without preconditions that re-starts
ECOMPOSE , problem.I
P in NITIAL , plan state
the current , finalState)
with goal g.
hla ← some HLA in plan
Observation: Replanning
prefix ,suffix induces
← the action a tradeoff between
subsequences preafter
before and planning
hla inand
planre-planning.
outcome ← R ESULT(problem.I NITIAL, prefix )
Example 21.8.1. The plan [RePlan(g)] is a (trivially) complete plan for any goal
for each sequence in R EFINEMENTS(hla, outcome, hierarchy ) do
g. (not helpful)
frontier ← Insert(A PPEND( prefix , sequence, suffix ), frontier )
Example 21.8.2. A plan with sub-plans for every contingency (e.g. what to do if
function
a meteorD ECOMPOSE
strikes) may (hierarchy , s0 , plan, sf ) returns a solution
be too costly/large. (wasted effort)
solution ← an empty plan
Example
while plan21.8.3. But when
is not empty do a tire blows while driving into the desert, we want to
have water pre-planned.
action ← R EMOVE -L AST(plan) (due diligence against catastrophies)
− −
si ← a state inInRstochastic
Observation: EACH (s0 ,or plan)
partially suchobservable
that sf ∈R EACH (si , action
environments we)also need some
problem ← a problem with I NITIAL = si and G OAL = sf
form of execution monitoring to determine the need for replanning (plan repair).
solution ← A PPEND(A NGELIC -S EARCH (problem, hierarchy , action ), solution)
sf ← si
return solution Michael Kohlhase: Artificial Intelligence 2 680 2023-02-10
Figure 11.11 A hierarchical planning algorithm that uses angelic semantics to identify and
Replanning for Plan
commit to high-level Repair
plans that work while avoiding high-level plans that don’t. The predi-
cate M AKING -P ROGRESS checks to make sure that we aren’t stuck in an infinite regression
of Generally:
Replanning
refinements. At top level, when the agent’s
call A NGELIC modelwith
-S EARCH of the world
[Act] is initialPlan
as the incorrect. .
Example 21.8.4 (Plan Repair by Replanning). Given a plan from S to G.
Figure 11.12 At first, the sequence “whole plan” is expected to get the agent from S to G.
The agent executes steps of the plan until it expects to be in state E, but observes that it is
actually in O. The agent then replans for the minimal repair plus continuation to reach G.
The agent executes wholeplan step by step, monitoring the rest (plan).
After a few steps the agent expects to be in E, but observes state O.
Replanning: by calling the planner recursively
find state P in wholeplan and a plan repair from O to P . (P may be G)
minimize the cost of repair + continuation
Factors in World Model Failure ; Monitoring

Generally: The agent’s world model can be incorrect, because
an action has a missing precondition (need a screwdriver for remove−lid)
an action misses an effect (painting a table gets paint on the floor)
it is missing a state variable (amount of paint in a can: no paint ; no color)
no provisions for exogenous events (someone knocks over a paint can)
Observation: Without a way for monitoring for these, planning is very brittle.
Definition 21.8.5. There are three levels of execution monitoring: before executing
an action
action monitoring checks whether all preconditions still hold.
plan monitoring checks that the remaining plan will still succeed.
goal monitoring checks whether there is a better set of goals it could try to
achieve.
Note: Example 21.8.4 was a case of action monitoring leading to replanning.
Integrated Execution Monitoring and Planning

Problem: Need to upgrade planing data structures by bookkeeping for execution
monitoring.
Observation: With their causal links, partially ordered plans already have most of
the infrastructure for action monitoring:
Preconditions of remaining plan
b all preconditions of remaining steps not achieved by remaining steps
=
b all causal link “crossing current time point”
=
Idea: On failure, resume planning (e.g. by POP) to achieve open conditions from
current state.
Definition 21.8.6. IPEM (Integrated Planning, Execution, and Monitoring):
keep updating Start to match current state

links from actions replaced by links from Start when done
Execution Monitoring Example

Example 21.8.7 (Shopping for a drill, milk, and bananas). Start/end at home,
drill sold by hardware store, milk/bananas by supermarket.

Chapter 22
Semester Change-Over
22.1 What did we learn in AI 1?


Getting Started
Problem Solving
Planning
Planning Frameworks
Planning Algorithms
423
424 CHAPTER 22. SEMESTER CHANGE-OVER
Rational Agents as an Evaluation Framework for AI

Agents interact with the environment
Section 2.1. Agents and Environments 35

General agent schema
Agent Sensors
Percepts
Environment
?
Actions
Actuators
Section 2.4. Simple
The Structure of Agents
Reflex Agents 49
there is to say about the agent. Mathematically speaking, we say that an agent’s behavior is
AGENT FUNCTION described by the agent Agent function that maps any given percept sequence to an action.
Sensors
agents, this would be a very large table—infinite, Whatin the fact,
world unless we place a bound on the
is like now
length of percept sequences we want to consider. Given an agent to experiment with, we can,
Environment
in principle, construct this table by trying out all possible percept sequences and recording
which actions the agent does in response.1 The table is, of course, an external characterization
of the agent. Internally, the agent function for an artificial agent will be implemented by an
AGENT PROGRAM agent program. It is important to keep these two ideas distinct. The agent function is an
abstract mathematical Condition-action
description; rules
the agent program is aI concrete implementation, running
What action
should do now
within some physical system.
To illustrate these ideas, we use a very simple example—the vacuum-cleaner world
Actuators
shown in Figure 2.2. This world is so simple that we can describe everything that happens;
it’s also a made-up world, so we can invent many variations. This particular world has just two
Figuresquares
locations: 2.9 Schematic
A and B.diagram of a simple
The vacuum agentreflex agent. which square it is in and whether
perceives
Reflex
there isAgents
dirt in with State It can choose to move left, move right, suck up the dirt, or do
the square.
suck; otherwise,
function S IMPLEmove to the-Aother
-R EFLEX GENTsquare.
( perceptA partial tabulation
) returns an action of this agent function is shown
persistent:
in Figure 2.3 and an agent
rules, program
a set of that implements
condition–action rules it appears in Figure 2.8 on page 48.
Looking
state at Figure-I2.3,
← I NTERPRET we see that various vacuum-world agents can be defined simply
NPUT( percept )
by filling in the right-hand column
rule ← RULE -M ATCH(state, rules) in various ways. The obvious question, then, is this: What
is theaction
right ←way to fill
rule.A CTIONout the table? In other words, what makes an agent good or bad,
intelligent
returnor stupid? We answer these questions in the next section.
action
1 If the agent uses some randomization to choose its actions, then we would have to try each sequence many
the current
show later state, as
in this chapter defined
that byvery
it can be the intelligent.
percept.
Section
22.1. 2.4.
WHATThe
DIDStructure of AgentsIN AI 1?
WE LEARN 51 425
Sensors
State
is like now
Environment
What my actions do

should do now
Agent Actuators
52 Figure 2.11 A model-based reflex agent. Chapter 2. Intelligent Agents

Goal-Based Agents

Sensors
State
What the world
rules, aHow
setthe
of world
condition–action
evolves rules is like now
Environment
state ← U PDATE -S What (state,
TATEmy action
actions do , percept ,ifmodel ) A
I do action
return action
What action I
Goals should do now
Agent Actuators
is responsible for creating

Figure 2.13 the newgoal-based
A model-based, internal state description.
agent. The
It keeps track details
of the ofstate
world howasmodels
well as and
54 statesa are represented vary widely depending on the Chapter 2.
type ofthat
environmentIntelligent Agents
and lead
the particular
Utility-Based
set of Agent
goals it is trying to achieve, and chooses an action will (eventually) to the
technology used of
achievement in its
thegoals.
agent design. Detailed examples of models and updating algorithms
Regardless of the kind of representation used, Sensorsit is seldom possible for the agent to
example, the
determine thetaxi may state
current be driving back home,
of a partially
State
and it may
observable have a rule
environment telling Instead,
exactly. it to fill up
thewith
box
gas on the way home
labeled “what the world unless it
is world has
like now” at least half a tank. Although “driving back home” may
evolves (Figure 2.11) represents the agent’s “best guess” (or
What the world
How the
seem to an best
sometimes aspect of the world
guesses). state, thean
For example, fact of the
automated taxi’s
is like
may not beisable
taxidestination
now actually
to seeanaround
aspect the
of
Environment
the agent’s internal state.

large truck that has stopped If you find this puzzling, consider that the taxi could be in exactly
What myinactions
frontdoof it and canWhatonly
it will guess
be like about what may be causing the
the same Thus,
hold-up. place uncertainty
at the same time, but
about the intending toif reach
current state
I do action
may be a different
A destination.
unavoidable, but the agent still has
to make a decision. How happy I will be
2.4.4A perhaps
Goal-based agents
less obvious
Utility
point about the internal in such a state
“state” maintained by a model-based
agent
Knowingis that it does not
something have
about thetocurrent
describe “what
state of thethe world
environment
What action I
is like now”
is not in a enough
always literal sense. For
to decide
should do now
what to do. For example, at a road junction, the taxi can turn left, turn right, or go straight
on. The correct decision
Agent depends on where the taxiActuators is trying to get to. In other words, as well
GOAL as a current state description, the agent needs some sort of goal information that describes
situations that are desirable—for example, being at the passenger’s destination. The agent
Learning Agents
basedaction
reflex agent) to choose actions that achieve the goal. Figure 2.13 shows the goal-based
that leads to the best expected utility, where expected utility is computed by averaging
Sometimes goal-based action selection is straightforward—for example, when goal sat-
outcome. (Appendix
example, when A defines
the agent has toexpectation
consider longmore precisely.)
sequences In Chapter
of twists 16, we
and turns show to
in order that any
find a
rational agent must
way to achieve behave
the goal. it possesses
as if (Chapters
Search 3 toa 5)
utility
and function
planningwhose expected
(Chapters 10 andvalue it tries
11) are the
to maximize.
subfields of AIAn agent that
devoted possesses
to finding an sequences
action explicit utility
that function
achieve thecanagent’s
make rational
goals. decisions
with aNotice
general-purpose algorithm
that decision making that does
of this notisdepend
kind on the specific
fundamentally differentutility function
from the being
condition–
maximized. In this way,
action rules described the “global”
earlier, in that itdefinition of rationality—designating
involves consideration of the future—both as rational
“Whatthose
will
agent
happenfunctions that have the highest
if I do such-and-such?” and “Willperformance—is turned into
that make me happy?” a “local”
In the constraint
reflex agent on
designs,
rational-agent
this information designs
is notthat can be expressed
explicitly in abecause
represented, simple theprogram.
built-in rules map directly from
The utility-based agent structure appears in Figure 2.14. Utility-based agent programs
Section
426 2.4. The Structure of Agents 55
CHAPTER 22. SEMESTER CHANGE-OVER
Critic Sensors
feedback
Environment
changes
element element
knowledge
learning
goals
Problem
generator
Actuators
Agent

He estimates how much work this might take and concludes “Some more expeditious method
Rational
them. InAgentmany areas of AI, this is now the preferred method for creating state-of-the-art
in initially
Idea: Tryunknown
to design environments
agents that andare
to become more competent than(do
successful its initial knowledge
the right thing)
alone might allow. In this section, we briefly introduce the main ideas of learning agents.
Definition
Throughout 22.1.1.
the book, An we agent
comment is called rational, if
on opportunities andit methods
chooses for
whichever
learning inaction max-
particular
kinds of
imizes theagents.
expected Part Vvalue
goes into
of themuch more depth on
performance the learning
measure givenalgorithms
the perceptthemselves.
sequence
to date.A learning
This is calledagent canthebeMEU divided into four conceptual components, as shown in Fig-
principle.
LEARNING ELEMENT ure 2.15. The most important distinction is between the learning element, which is re-
PERFORMANCE
ELEMENT
Note:
sponsibleAfor rational
makingagent need notand
improvements, bethe
perfect
performance element, which is responsible for
selecting external actions. The performance element is what we have previously considered
to only
be theneeds to maximize
entire agent: it takes inexpected
percepts value
and decides on actions. The (rational
learning omniscient)
̸=element uses
CRITIC feedback
need from critic on
notthepredict e.g.how verytheunlikely
agent is but
doing and determines
catastrophic how in
events thethe
performance
future
percepts may not supply all relevant information (Rational ̸= clairvoyant)
The design of the learning element depends very much on the design of the performance
if we
element. When cannot
tryingperceive
to design things
an agentwe dolearns
that not need to react
a certain to them.
capability, the first question is
not “How am I going to get it to learn this?” but
but we may need to try to find out about hidden dangers “What kind of performance element will my
(exploration)
can be constructed to improve every part of the agent.
but we may need to take action to ensure that they dowith
The critic tells the learning element how well the agent is doing (morerespect to a fixed
often)
performance standard.
(learning) The critic is necessary because the percepts themselves provide no
Rational
indicating; thatexploration, learning,
it has checkmated autonomy
its opponent, but it needs a performance standard to know
Symbolic AI: Adding Knowledge to Algorithms

Problem Solving (Black Box States, Transitions, Heuristics)
Framework: Problem Solving and Search (basic tree/graph walking)

Variant: Game playing (Adversarial Search) (Minimax + αβ-Pruning)
Constraint Satisfaction Problems (heuristic search over partial assignments)
States as partial variable assignments, transitions as assignment
22.1. WHAT DID WE LEARN IN AI 1? 427
Heuristics informed by current restrictions, constraint graph

Inference as constraint propagation (transferring possible values across arcs)
Describing world states by formal language (and drawing inferences)
Propositional logic and DPLL (deciding entailment efficiently)

First-order logic and ATP (reasoning about infinite domains)
Digression: Logic programming (logic + search)
Description logics as moderately expressive, but decidable logics
Planning: Problem Solving using white-box world/action descriptions

Framework: describing world states in logic as sets of propositions and actions
by preconditions and add/delete lists
Algorithms: e.g heuristic search by problem relaxations

Uncertainty

Artificial Intelligence I/II
Prof. Dr. Michael Kohlhase

Professur für Wissensrepräsentation und -verarbeitung
Informatik, FAU Erlangen-Nürnberg
Michael.Kohlhase@FAU.de
22.2. ADMINISTRATIVA 429
22.2 Administrativa
We will now go through the ground rules for the course. This is a kind of a social contract
between the instructor and the students. Both have to keep their side of the deal to make learning
as efficient and painless as possible.
Prerequisites for AI-2

Content Prerequisites: the mandatory courses in CS@FAU; Sem 1-4, in particular:
course “Mathematik C4” (InfMath4).

(very) elementary complexity theory. (big Oh and friends)
also AI-1 (“Artificial Intelligence I”) (of course)
Intuition: (take them with a kilo of salt)

This is what I assume you know! (I have to assume something)
In many cases, the dependency of AI-2 on these is partial and “in spirit”.
If you have not taken these (or do not remember), read up on them as needed!
The real Prerequisite: Motivation, Interest, Curiosity, hard work. (AI-2 is

non-trivial)
You can do this course if you want! (and I hope you are successful)
Now we come to a topic that is always interesting to the students: the grading scheme.
Assessment, Grades
Academic Assessment: 90 minutes exam directly after courses end (∼ July 25
2023)
Retake Exam: 90 min exam directly after courses end the following semester (∼
Feb. 13. 2023)
Module Grade:
Grade via the exam (Klausur) ; 100% of the grade
Results from “Übungen zu Künstliche Intelligenz” give up to 10% bonus to an
exam with ≥ 50% points. (not passed ; no bonus)
I do not think that this is the best possible scheme, but I have very little choice.
I basically do not have a choice in the grading sheme, as it is essentially the only one consistent with
university/state policies. For instance, I would like to give you more incentives for the homework
assignments – which would also mitigate the risk of having a bad day in the exam. Also, graded
quizzes would help you prepare for the lectures and thus let you get more out of them, but that
is also impossible.
AI-2 Homework Assignments

Homeworks: will be small individual problem/programming/proof assignments
but take time to solve (at least read them directly ; questions)
group submission if and only if explicitly permitted.
Double Jeopardy : Homeworks only give 10% bonus points for the
exam, but without trying you are unlikely to pass the exam.
Admin: To keep things running smoothly
Homeworks will be posted on StudOn.
Sign up for AI-2 under https://www.studon.fau.de/crs4419186.html.
Homeworks are handed in electronically there. (plain text, program files, PDF)
Go to the tutorials, discuss with your TA! (they are there for you!)
Homework Discipline:
Start early! (many assignments need more than one evening’s work)
Don’t start by sitting at a blank screen (talking & study group help)
Humans will be trying to understand the text/code/math when grading it.
It is very well-established experience that without doing the homework assignments (or something
similar) on your own, you will not master the concepts, you will not even be able to ask sensible
questions, and take nothing home from the course. Just sitting in the course and nodding is not
enough! If you have questions please make sure you discuss them with the instructor, the teaching
assistants, or your fellow students. There are three sensible venues for such discussions: online in
the lecture, in the tutorials, which we discuss now, or in the course forum – see below. Finally, it
is always a very good idea to form study groups with your friends.
Tutorials for Artificial Intelligence 1
Approach: Weekly tutorials and homework assignments (first one in week two)
Goal 1: Reinforce what was taught in class. (you need practice)
Goal 2: Allow you to ask any question you have in a protected environment.
Instructor/Lead TA:
Florian Rabe (KWARC Postdoc)
Room: 11.137 @ Händler building, florian.rabe@fau.de
Tutorials: one each taught by Florian Rabe, . . .

Life-saving Advice: Go to your tutorial, and prepare for it by having looked at
the slides and the homework assignments!
Caveat: We cannot grade all submissions with 5 TAs and ∼1000 students.
22.2. ADMINISTRATIVA 431
Also: Group submission has not worked well in the past! (too many freeloaders)
Do use the opportunity to discuss the AI-2 topics with others. After all, one of the non-trivial
skills you want to learn in the course is how to talk about Artificial Intelligence topics. And that
takes practice, practice, and practice. But what if you are not in a lecture or tutorial and want
to find out more about the AI-2 topics?
Textbook, Handouts and Information, Forums, Videos

Textbook: Russel & Norvig: Artificial Intelligence, A modern Approach [RN09].
basically “broad but somewhat shallow”

great to get intuitions on the basics of AI
Make sure that you read the edition ≥ 3 ⇝ vastly improved over ≤ 2.
Course notes: will be posted at http://kwarc.info/teaching/AI/notes.pdf
more detailed than [RN09] in some areas

I mostly prepare them as we go along (semantically preloaded ; research
resource)
please e-mail me any errors/shortcomings you notice. (improve for the group)
StudOn Forum: https://www.studon.fau.de/crs4419186.html for
announcements, homeworks (my view on the forum)

questions, discussion among your fellow students (your forum too, use it!)
Course Videos: AI-2 will be streamed/recorded at https://fau.tv/course/
id/2095
Organized: Video course nuggets are available at https://fau.tv/course/

id/1690 (short; organized by topic)
Backup: The lectures from WS 2016/17 to SS 2018 have been recorded
(in English and German), see https://www.fau.tv/search/term.html?q=
Kohlhase
FAU has issued a very insightful guide on using lecture recordings. It is a good idea to heed these
recommendations, even if they seem annoying at first.
Practical recommendations on Lecture Resources

Excellent Guide: [Nor+18a] (german Version at [Nor+18b])
Using lecture
 
Attend lectures.
recordings: Take notes.

A guide for students
Be specific.
Catch up.
Ask for help.
Don’t cut corners.
Due to the current AI hype, the course Artificial Intelligence is very popular and thus many
degree programs at FAU have adopted it for their curricula. Sometimes the course setup that fits
for the CS program does not fit the other’s very well, therefore there are some special conditions.
I want to state here.
Special Admin Conditions

Some degree programs do not “import” the course Artificial Intelligence, and thus
you may not be able to register for the exam via https://campus.fau.de.
Just send me an e-mail and come to the exam, we will issue a “Schein”.
Tell your program coordinator about AI-1/2 so that they remedy this situation
In “Wirtschafts-Informatik” you can only take AI-1 and AI-2 together in the “Wahlpflicht-
bereich”.
ECTS credits need to be divisible by five ⇝ 7.5 + 7.5 = 15.
I can only warn of what I am aware, so if your degree program lets you jump through extra hoops,
please tell me and then I can mention them here.
22.3 Overview over AI and Topics of AI-II
We restart the new semester by reminding ourselves of (the problems, methods, and issues of)
Artificial Intelligence, and what has been achived so far.
22.3.1 What is Artificial Intelligence?

22.3. OVERVIEW OVER AI AND TOPICS OF AI-II 433
The first question we have to ask ourselves is “What is Artificial Intelligence?”, i.e. how can we
define it. And already that poses a problem since the natural definition like human intelligence,
but artificially realized presupposes a definition of Intelligence, which is equally problematic; even
Psychologists and Philosophers – the subjects nominally “in charge” of human intelligence – have
problems defining it, as witnessed by the plethora of theories e.g. found at [WHI].
What is Artificial Intelligence? Definition

Definition 22.3.1 (According to
Wikipedia). Artificial Intelligence (AI)
is intelligence exhibited by machines
Definition 22.3.2 (also). Artificial Intelli-

gence (AI) is a sub-field of computer science
that is concerned with the automation of in-
telligent behavior.
BUT: it is already difficult to define “Intel-
ligence” precisely
Definition 22.3.3 (Elaine Rich). Artificial
Intelligence (AI) studies how we can make
the computer do things that humans can still
do better at the moment.
Maybe we can get around the problems of defining “what Artificial intelligence is”, by just de-
scribing the necessary components of AI (and how they interact). Let’s have a try to see whether
that is more informative.
What is Artificial Intelligence? Components

Elaine Rich: AI studies how we

can make the computer do things
that humans can still do better at
the moment.
This needs a combination of
the ability to learn

inference
perception
language understanding
emotion
22.3.2 Artificial Intelligence is here today!

The components of Artificial Intelligence are quite daunting, and none of them are fully un-
derstood, much less achieved artificially. But for some tasks we can get by with much less. And
indeed that is what the field of Artificial Intelligence does in practice – but keeps the lofty ideal
around. This practice of “trying to achieve AI in selected and restricted domains” (cf. the discus-
sion starting with slide 27) has borne rich fruits: systems that meet or exceed human capabilities
in such areas. Such systems are in common use in many domains of application.
Artificial Intelligence is here today!

in outer space
in outer space systems
need autonomous con-
trol:
remote control impos-
sible due to time lag
in artificial limbs
the user controls the
prosthesis via existing
nerves, can e.g. grip
a sheet of paper.
in household appliances
The iRobot Roomba
vacuums, mops, and
sweeps in corners, . . . ,
parks, charges, and
discharges.
general robotic house-
hold help is on the
horizon.
in hospitals
in the USA 90% of the
prostate operations are
carried out by Ro-
boDoc
Paro is a cuddly robot
that eases solitude in
nursing homes.
And here’s what you all have been waiting for . . .
AlphaGo is a program by Google DeepMind to play the board game go.

In March 2016, it beat Lee Sedol in a five-game match, the first time a go pro-
gram has beaten a 9 dan professional without handicaps. In December 2017
AlphaZero, a successor of AlphaGo “learned” the games go, chess, and shogi in
24 hours, achieving a superhuman level of play in these three games by defeating
world-champion programs. By September 2019, AlphaStar, a variant of AlphaGo,
attained “grandmaster level” in Starcraft II, a real time strategy game with partially
observable state. AlphaStar now among the top 0.2% of human players.
We will conclude this subsection with a note of caution.
The AI Conundrum
Observation: Reserving the term “Artificial Intelligence” has been quite a land
grab!
But: researchers at the Dartmouth Conference (1950) really thought they would
solve/reach AI in two/three decades.
Consequence: AI still asks the big questions.
Another Consequence: AI as a field is an incubator for many innovative tech-
nologies.
AI Conundrum: Once AI solves a subfield it is called “computer science”.

(becomes a separate subfield of CS)
Example 22.3.4. Functional/Logic Programming, ATPautomated theorem prov-
ing, Planning, Machine Learning, Knowledge Representation, . . .
Still Consequence: AI research was alternatingly flooded with money and cut off
brutally.

22.3.3 Ways to Attack the AI Problem

There are currently three main avenues of attack to the problem of building artificially intelligent
systems. The (historically) first is based on the symbolic representation of knowledge about the
world and uses inference-based methods to derive new knowledge on which to base action decisions.
The second uses statistical methods to deal with uncertainty about the world state and learning
methods to derive new (uncertain) world assumptions to act on.
Three Main Approaches to Artificial Intelligence

Definition 22.3.5. Symbolic AI is based on the assumption that many aspects of
intelligence can be achieved by the manipulation of symbols, combining them into
structures (expressions) and manipulating them (using processes) to produce new
expressions.
Definition 22.3.6. Statistical AI remedies the two shortcomings of symbolic AI
approaches: that all concepts represented by symbols are crisply defined, and that all
aspects of the world are knowable/representable in principle. Statistical AI adopts
sophisticated mathematical models of uncertainty and uses them to create more
accurate world models and reason about them.
Definition 22.3.7. Subsymbolic AI attacks the assumption of symbolic and sta-
tistical AI that intelligence can be achieved by reasoning about the state of the
world. Instead it posits that intelligence must be embodied i.e. situated in the
world, equipped with a “body” that can interact with it via sensors and actuators.
The main method for realizing intelligent behavior is by learning from the world, i.e.
machine learning.
As a consequence, the field of Artificial Intelligence (AI) is an engineering field at the intersec-
tion of computer science (logic, programming, applied statistics), cognitive science (psychology,
neuroscience), philosophy (can machines think, what does that mean?), linguistics (natural lan-
guage understanding), and mechatronics (robot hardware, sensors).
Subsymbolic AI and in particular machine learning is currently hyped to such an extent, that
many people take it to be synonymous with “Artificial Intelligence”. It is one of the goals of this
course to show students that this is a very impoverished view.
Two ways of reaching Artificial Intelligence?

We can classify the AI approaches by their coverage and the analysis depth (they
are complementary)
Deep symbolic not there yet

AI-1 cooperation?
Shallow no-one wants this statistical/sub symbolic

AI-2
Analysis ↑
vs. Narrow Wide
Coverage →
This semester we will cover foundational aspects of symbolic AI (deep/narrow

processing)
next semester concentrate on statistical/subsymbolic AI.
(shallow/wide-coverage)
We combine the topics in this way in this course, not only because this reproduces the historical
development but also as the methods of statistical and subsymbolic AI share a common basis.
It is important to notice that all approaches to AI have their application domains and strong points.
We will now see that exactly the two areas, where symbolic AI and statistical/subsymbolic AI
have their respective fortes correspond to natural application areas.
Environmental Niches for both Approaches to AI

Observation: There are two kinds of applications/tasks in AI
Consumer tasks: consumer grade applications have tasks that must be fully
generic and wide coverage. ( e.g. machine translation like Google Translate)
Producer tasks: producer grade applications must be high-precision, but can be
domain-specific (e.g. multilingual documentation, machinery-control, program
verification, medical technology)
Precision
100% Producer Tasks
50% Consumer Tasks
103±1 Concepts 106±1 Concepts Coverage
General Rule: Subsymbolic AI is well suited for consumer tasks, while symbolic
AI is better suited for producer tasks.
A domain of producer tasks I am interested in: Mathematical/Technical Docu-

ments.
An example of a producer task – indeed this is where the name comes from – is the case of a
machine tool manufacturer T , which produces digitally programmed machine tools worth multiple
million Euro and sells them into dozens of countries. Thus T must also comprehensive machine
operation manuals, a non-trivial undertaking, since no two machines are identical and they must
be translated into many languages, leading to hundreds of documents. As those manual share a lot
of semantic content, their management should be supported by AI techniques. It is critical that
these methods maintain a high precision, operation errors can easily lead to very costly machine
damage and loss of production. On the other hand, the domain of these manuals is quite restricted.
A machine tool has a couple of hundred components only that can be described by a comple of
thousand attribute only.
Indeed companies like T employ high-precision AI techniques like the ones we will cover in this
course successfully; they are just not so much in the public eye as the consumer tasks.
To get this out of the way . . .
AlphaGo = search + neural networks (symbolic + subsymbolic AI)
we do search this semester and cover neural networks in AI-2.

I will explain AlphaGo a bit in chapter 9.
22.3.4 AI in the KWARC Group
The KWARC Research Group

Observation: The ability to represent knowledge about the world and to draw
logical inferences is one of the central components of intelligent behavior.
Thus: reasoning components of some form are at the heart of many AI systems.
KWARC Angle: Scaling up (web-coverage) without dumbing down (too much)
Content markup instead of full formalization (too tedious)
User support and quality control instead of “The Truth” (elusive anyway)
use Mathematics as a test tube ( Mathematics =
b Anything Formal )
care more about applications than about philosophy (we cannot help getting
this right anyway as logicians)
The KWARC group was established at Jacobs Univ. in 2004, moved to FAU Erlan-
gen in 2016
see http://kwarc.info for projects, publications, and links
Overview: KWARC Research and Projects
Applications: eMath 3.0, Active Documents, Active Learning, Semantic Spread-

sheets/CAD/CAM, Change Mangagement, Global Digital Math Library, Math
Search Systems, SMGloM: Semantic Multilingual Math Glossary, Serious Games,
...
Foundations of Math: KM & Interaction: Semantization:
MathML, OpenMath Semantic Interpretation LATEXML: LATEX → XML
advanced Type Theories (aka. Framing) STEX: Semantic LATEX
MMT: Meta Meta The- math-literate interaction invasive editors
ory MathHub: math archi-
Context-Aware IDEs
Logic Morphisms/Atlas ves & active docs
Mathematical Corpora
Theorem Prover/CAS In- Active documents: em-
bedded semantic services Linguistics of Math
teroperability
Model-based Education ML for Math Semantics
Mathematical Model- Extraction
s/Simulation
Foundations: Computational Logic, Web Technologies, OMDoc/MMT
Research Topics in the KWARC Group

We are always looking for bright, motivated KWARCies.
We have topics in for all levels! (Enthusiast, Bachelor, Master, Ph.D.)
List of current topics: https://gl.kwarc.info/kwarc/thesis-projects/
Automated Reasoning: Maths Representation in the Large

Logics development, (Meta)n -Frameworks
Math Corpus Linguistics: Semantics Extraction
Serious Games, Cognitive Engineering, Math Information Retrieval, Legal Rea-
soning, . . .
We always try to find a topic at the intersection of your and our interests.
1
We also often have positions!. (HiWi, Ph.D.: 2 , PostDoc: full)
22.3.5 AI-II: Advanced Rational Agents

Remember the conceptual framework we gave ourselves in chapter 7: we posited that all (artificial
and natural ) intelligence is situated in an agent that interacts with a given environment, and
postulated that what we experience as “intelligence” in a (natural or artificial) agent can be
ascribed to the agent behaving rationally, i.e. optimizing the expected utility of its actions given
the (current) environment.

In the last semester we restricted ourselves to fully observable, deterministic, episodic environ-
ments, where optimizing utility is easy in principle – but may still be computationally intractable,
since we have full information about the world
Artificial Intelligence II Overview

We construct rational agents.
An agent is an entity that perceives its environment through sensors and acts upon
that environment through actuators.
A rational agent is an agent maximizing its expected performance measure.
In AI-1 we dealt mainly with a logical approach to agent design (no uncertainty).
We ignored
interface to environment (sensors, actuators)
uncertainty
the possibility of self-improvement (learning)
This semester we want to alleviate all these restrictions and study rationality in more realistic
circumstances, i.e. environments which need only be partially observe and where our actions can
be non deterministic. Both of these extensions conspire to allow us only partial knowledge about
the world, so that we can only optimize “expected utility” instead of “ actual utility” of our actions.
This directly leads to the first topic.
The second topic is motivated by the fact that environments can change and and are initially
unknown, and therefore the agent must obtain and/or update parameters like utilities and world
knowledge by observing the environment.

Uncertainty
The last topic (which we will only attack if we have time) is motivated by multi agent environ-
ments, where multiple agents have to collaborate for problem solving. Note that even though the
adversarial search methods discussed in chapter 9 were essentially single agent as both opponents
optimized the utility of their actions alone.
In true multi agent environments we have to also optimize collaboration between agents, and
that is usually radially more efficient if agents can communicate.
Part V
Reasoning with Uncertain

Knowledge
445
447
This part of the course notes addresses inference and agent decision making in partially observable
environments, i.e. where we only know probabilities instead of certainties whether propositions
are true/false. We cover basic probability theory and – based on that – Bayesian Networks and
simple decision making in such environments. Finally we extend this to probabilistic temporal
models and their decision theory.
448
Chapter 23
Quantifying Uncertainty
In this chapter we develop a machinery for dealing with uncertainty: Instead of thinking about
what we know to be true, we must think about what is likely to be true.
23.1 Dealing with Uncertainty: Probabilities

Before we go into the technical machinery in section 23.1, let us contemplate the sources of
uncertainty our agents might have to deal with (subsection 23.1.1) and how the agent models need
to be extended to cope with that (section 21.4).
23.1.1 Sources of Uncertainty

Sources of Uncertainty in Decision-Making
Where’s that d. . . Wumpus?

And where am I, anyway??
Non-deterministic actions:
“When I try to go forward in this dark cave, I might actually go forward-left or
forward-right.”
Partial observability with unreliable sensors:

“Did I feel a breeze right now?”;
“I think I might smell a Wumpus here, but I got a cold and my nose is blocked.”
“According to the heat scanner, the Wumpus is probably in cell [2,3].”
Uncertainty about the domain behavior:
449
450 CHAPTER 23. QUANTIFYING UNCERTAINTY
“Are you sure the Wumpus never moves?”
Unreliable Sensors
Robot Localization: Suppose we want to support localization using landmarks
to narrow down the area.
Example 23.1.1. If you see the Eiffel tower, then you’re in Paris.
Difficulty: Sensors can be imprecise.
Even if a landmark is perceived, we cannot conclude with certainty that the
robot is at that location.
This is the half-scale Las Vegas copy, you dummy.
Even if a landmark is not perceived, we cannot conclude with certainty that the
robot is not at that location.
Top of Eiffel tower hidden in the clouds.
Only the probability of being at a location increases or decreases.
23.1.2 Recap: Rational Agents as a Conceptual Framework



23.1. DEALING WITH UNCERTAINTY: PROBABILITIES 451
Agent Schema: Visualizing the Internal Agent Structure
Section 2.1. Agent

Agents andSchema: We will use the following kind of schema to visualize the internal
Environments 35
structure of an agent:
Agent Sensors
Percepts
Environment
?
Actions
Actuators
Different agents differ on the contents of the white box in the center.
there is to say about the agent.

Michael Kohlhase:Mathematically
Artificial Intelligence 2 speaking, 715 we say that an agent’s behavior is
2023-02-10
AGENT FUNCTION described by the agent function that maps any given percept sequence to an action.
Rationality
agents, this would be a very large table—infinite, in fact, unless we place a bound on the
in principle,
Idea: Tryconstruct this agents
to design table by that trying out all possible percept
are successful! sequences
(aka. “do the rightand recording
thing”)
Definition
of the 23.1.4. the
agent. Internally, A performance
agent function measure
for anisartificial
a function thatwill
agent evaluates a sequenceby an
be implemented
of environments.
abstract mathematical
Example 23.1.5.description;
A performance the agent measure program
for theisvacuum
a concrete implementation,
cleaner world could running
award one point per square cleaned up in time T ?
shown in Figureone
award 2.2.point
Thisper cleanis square
world so simple per that
timewestep,
canminus one everything
describe per move? that happens;
it’s alsoapenalize
made-upforworld,
> k so we squares?
dirty can invent many variations. This particular world has just two
locations: squares A and B. The vacuum agent perceives which square it is in and whether
Definition 23.1.6. An agent is called rational, if it chooses whichever action max-
there is dirt in the square. It can choose to move left, move right, suck up the dirt, or do
imizes the expected value of the performance measure given the percept sequence
to date.
suck; otherwise, move to the other square. A partial tabulation of this agent function is shown
Question:
in Figure Why
2.3 and an is rationality
agent program that a good quality toit aim
implements for? in Figure 2.8 on page 48.
appears
Looking at Figure 2.3, we see that various vacuum-world agents can be defined simply
by filling in the right-hand column in various ways. The716
obvious question,
2023-02-10
then, is this: What
is the right way to fill out the table? In other words, what makes an agent good or bad,
intelligent or stupid? We answer these questions in the next section.
Consequences of Rationality: Exploration, Learning, Autonomy
Note:
showlater in thisachapter
rational
thatagent need
it can be verynot be perfect
intelligent.
only needs to maximize expected value (rational ̸= omniscient)

need not predict e.g. very unlikely but catastrophic events in the future
percepts may not supply all relevant information (rational ̸= clairvoyant)
if we cannot perceive things we do not need to react to them.
but we may need to try to find out about hidden dangers (exploration)
but we may need to take action to ensure that they do (more often)
(learning)
Note: rational ; exploration, learning, autonomy
Definition 23.1.7. An agent is called autonomous, if it does not rely on the prior
knowledge about the environment of the designer.
Autonomy avoids fixed behaviors that can become unsuccessful in a changing en-
vironment. (anything else would be
irrational)
The agent has to learning agentlearn all relevant traits, invariants, properties of the
environment and actions.
PEAS: Describing the Task Environment

Observation: To design a rational agent, we must specify the task environment in
terms of performance measure, environment, actuators, and sensors, together called
the PEAS components.
Example 23.1.8. When designing an automated taxi:
Performance measure: safety, destination, profits, legality, comfort, . . .
Environment: US streets/freeways, traffic, pedestrians, weather, . . .
Actuators: steering, accelerator, brake, horn, speaker/display, . . .
Sensors: video, accelerometers, gauges, engine sensors, keyboard, GPS, . . .
Example 23.1.9 (Internet Shopping Agent).
The task environment:
Performance measure: price, quality, appropriateness, efficiency

Environment: current and future WWW sites, vendors, shippers
Actuators: display to user, follow URL, fill in form
Sensors: HTML pages (text, graphics, scripts)
Environment types
Observation 23.1.10. Agent design is largely determined by the type of environ-

ment it is intended for.
Problem:
There is a vast number of possible kinds of environments in AI.
Solution: Classify along a few “dimensions”. (independent characteristics)

Definition 23.1.11. For an agent a we classify the environment e of a by its type,
which is one of the following. We call e
1. fully observable, iff the a’s sensors give it access to the complete state of the
2. deterministic, iff the next state of the environment is completely determined by
the current state and a’s action, else stochastic.
3. episodic, iff a’s experience is divided into atomic episodes, where it perceives and
then performs a single action. Crucially the next episode does not depend on
previous ones. Non-episodic environments are called sequential.
4. dynamic, iff the environment can change without an action performed by a, else
static. If the environment does not change but a’s performance measure does,
we call e semidynamic.
5. discrete, iff the sets of e’s state and a’s actions are countable, else continuous.
6. single agent, iff only a acts on e; else multi agent (when must we count parts of
e as agents?)
Simple reflex agents

Definition 23.1.12. A simple reflex agent is an agent a that only bases its actions
on the last percept: so the agent function simplifies to f a : P→A.
Agent
Section 2.4. Schema:
The Structure of Agents 49
Agent Sensors
What the world

is like now
Environment

should do now
Actuators
Figure 2.9 Schematic diagram of a simple reflex agent.

function S IMPLE -R EFLEX -AGENT( percept ) returns an action
procedure Reflex−Vacuum−Agent [location,status]
persistent: rules, a set of condition–action rules returns an action
if status = Dirty
state then -I. NPUT
← I NTERPRET . . ( percept )
return action
the current state, as defined by the percept.
Model-based Reflex Agents: Idea

Idea: Keep track of the state of the world we cannot see in an internal model.
Agent Schema:
Sensors
State
is like now
Environment
What my actions do

should do now
Agent Actuators


Model-based Reflex Agents: Definition
Definition 23.1.14.
action, the
Amost recent
model action,agent
based initially none called reflex agent with state)
(also is
an agent
state whose function
← U PDATE depends
-S TATE(state, on, percept , model )
action
a action
world←model: a set S of possible states.
rule.ACTION
a return
sensoraction
model S that given a state s and percepts determines a new state s′ .
(optionally) a transition model T , that predicts a new state s′′ from a state s′
and ananaction
using .
internalamodel. It then chooses an action in the same way as the reflex agent.
An action function f that maps (new) states to actions.
The agent function is iteratively computed via e7→f (S(s, e)).
states are represented vary widely depending on the type of environment and the particular
technology
Note: As used in thepercept
different agent design.
sequences Detailed leadexamples of models
to different states,and
soupdating
the agent algorithms
function
fappear
a : P ∗ in Chapters 4, 12, 11, 15, 17, and 25.
→A no longer depends only on the last percept.
Example
determine the23.1.15
current(Tail
state of Lights Again).
a partially Modelenvironment
observable based agents can do
exactly. the 96
Instead, theifbox
the
labeledinclude
states “what the world isof
a concept liketailnow”light(Figure 2.11) represents the agent’s “best guess” (or
brightness.
hold-up. Thus, uncertainty about
Michael Kohlhase: theIntelligence
Artificial current2 state may be722 unavoidable, but the agent still has
2023-02-10
to make a decision.
A perhaps less obvious point about the internal “state” maintained by a model-based
23.1.3 Agent Architectures based on Belief States
We are now ready to proceed to environments which can only partially observed and where are
our actions are non deterministic. Both sources of uncertainty conspire to allow us only partial
knowledge about the world, so that we can only optimize “expected utility” instead of “actual
utility” of our actions.
World Models for Uncertainty

Problem: We do not know with certainty what state the world is in!
Idea: Just keep track of all the possible states it could be in.
Definition 23.1.16. A model based agent has a world model consisting of
a belief state that has information about the possible states the world may be
in, and
a sensor model that updates the belief state based on sensor information
a transition model that updates the belief state based on actions.
Idea: The agent environment determines what the world model can be.
In a fully observable, deterministic environment,

we can observe the initial state and subsequent states are given by the actions
alone.
thus the belief state is a singleton set (we call its member the world state) and
the transition model is a function from states and actions to states: a transition
function.
That is exactly what we have been doing until now: we have been studying methods that
build on descriptions of the “actual” world, and have been concentrating on the progression from
atomic to factored and ultimately structured representations. Tellingly, we spoke of “world states”
instead of “belief states”; we have now justified this practice in the brave new belief-based world
models by the (re-) definition of “world states” above. To fortify our intuitions, let us recap from
a belief-state-model perspective.

Note: All of these considerations only give requirements to the world model.

inference =
b logical formula
inference =


b STRIPS,
inference =
Let us now see what happens when we lift the restrictions of total observability and determin-
ism.
World Models for Complex Environments

In a fully observable, but stochastic environment,

; generalize the transition function to a transition relation.
Note: This even applies to online problem solving, where we can just perceive
the state. (e.g. when we want to optimize utility)
In a deterministic, but partially observable environment,

we can use transition functions.
We need a sensor model, which predicts the influence of percepts on the belief
state – during update.
In a stochastic, partially observable environment,

mix the ideas from the last two. (sensor model + transition relation)
Preview: New World Models (Belief) ; new Agent Types

Probabilistic Agents: In a partially observable environment
belief state =
b Bayesian networks,
inference =
b probabilistic inference.
Decision-Theoretic Agents:
In a partially observable, stochastic environment
belief state + transition model =
b decision networks,
inference =
b maximizing expected utility.
We will study them in detailthis semester.
23.1.4 Modeling Uncertainty

So we have extended the agent’s world models to use sets of possible worlds instead of single
(deterministic) world states. Let us evaluate whether this is enough for them to survive in the
world.
Wumpus World Revisited

Recall: We have updated agents with world/transition models with possible worlds.
Problem: But pure sets of possible worlds are not enough
Example 23.1.17 (Beware of the Pit).
We have a maze with pits that are detected

in neighbouring squares via breeze (Wumpus
and gold will not be assumed now).
Where does the agent should go, if there is

breeze at (1,2) and (2,1)?
Problem: (1.3), (2,2), and (3.1) are all
unsafe! (there are possible worlds with pits
in any of them)
Idea: We need world models that estimate the pit-likelyhood in cells!
Uncertainty and Logic

Example 23.1.18 (Diagnosis). We want to build an expert dental diagnosis
system, that deduces the cause (the disease) from the symptoms.
Can we base this on logic?
Attempt 1: Say we have a toothache. How’s about:
∀p Symptom(p, toothache) ⇒ Disease(p, cavity)
Is this rule correct?

No, toothaches may have different causes (“cavity” =
b “Loch im Zahn”).
Attempt 2: So what about this:
∀p Symptom(p, toothache) ⇒ (Disease(p, cavity) ∨ Disease(p, gingivitis) ∨ . . .)
We don’t know all possible causes.

And we’d like to be able to deduce which causes are more plausible!
Uncertainty and Logic, ctd.

Attempt 3: Perhaps a causal rule is better?
∀p Disease(p, cavity) ⇒ Symptom(p, toothache)
Question: Is this rule correct?
Answer: No, not all cavities cause toothaches.

Question: Does this rule allow to deduce a cause from a symptom?
Answer: No, setting Symptom(p, toothache) to true here has no consequence on
the truth of Disease(p, cavity).
Note: If Symptom(p, toothache) is false, we would conclude ¬Disease(p, cavity)

. . . which would be incorrect, cf. previous question.
Anyway, this still doesn’t allow to compare the plausibility of different causes.
Summary: Logic does not allow to weigh different alternatives, and it does
not allow to express incomplete knowledge (“cavity does not always come with a
toothache, nor vice versa”).
Beliefs and Probabilities

Question: What do we model with probabilities?
Answer: Incomplete knowledge!
We are certain, but we believe to a certain degree that something is true.

Probability =
b Our degree of belief, given our current knowledge.
Example 23.1.19 (Diagnosis).
Symptom(p, toothache) ⇒ Disease(p, cavity) with 80% probability.

But, for any given p, in reality we do, or do not, have cavity: 1 or 0!
The “probability” depends on our knowledge!
The “80%” refers to the fraction of cavities within the set of all p′ that are
indistinguishable from p based on our knowledge.
If we receive new knowledge (e.g., Disease(p, gingivitis)), the probability changes!
Probabilities represent and measure the uncertainty that stems from lack of knowl-
edge.
How to Obtain Probabilities?

Assessing probabilities through statistics:
The agent is 90% convinced by its sensor information. (in 9 out of 10 cases,
the information is correct)

Disease(p, cavity) ⇒ Symptom(p, toothache) with 80% probability
:=
b 8 out of 10 persons with a cavity have toothache.
Definition 23.1.20. The process of estimating a probability P using statistics is
called assessing P .
Observation: Assessing even a single P can require huge effort!
Example 23.1.21. The likelihood of making it to the university within 10 minutes.
What is probabilistic reasoning? Deducing probabilities from knowledge about
other probabilities.
Idea: Probabilistic reasoning determines, based on probabilities that are (relatively)
easy to assess, probabilities that are difficult to assess.
23.1.5 Acting under Uncertainty

Decision-Making Under Uncertainty

Example 23.1.22 (Giving a lecture).
Goal: Be in HS002 at 10:15 to give a lecture.
Possible plans:
P 1 : Get up at 8:00, leave at 8:40, arrive at 9:00.
P 2 : Get up at 9:50, leave at 10:05, arrive at 10:15.
Decision: Both plans are correct, but P 2 succeeds only with probability 50%,
and giving a lecture is important, so P 1 is the plan of choice.
Example 23.1.23 (Better Example). Which train to take to Frankfurt airport?
Uncertainty and Rational Decisions

Here: We’re only concerned with deducing the likelihood of facts, not with action
choice. In general, selecting actions is of course important.
Rational Agents:
We have a choice of actions (go to FRA early, go to FRA just in time).
These can lead to different solutions with different probabilities.
The actions have different costs.
The results have different utilities (safe timing/dislike airport food).
A rational agent chooses the action with the maximum expected utility.
Decision Theory = Utility Theory + Probability Theory.
Utility-based agents
Definition 23.1.24. A utility based agent uses a world model along with a utility
function that models its preferences among the states of that world. It chooses the
action that leads to the best expected utility.
54 Chapter 2. Intelligent Agents
Agent Schema:
Sensors
State
What the world
Environment

in such a state
What action I
should do now
13 Agent
QUANTIFYING
UNCERTAINTY
best expected
Actuators
action that leadsMichael 734
Decision-Theoretic Agent
Example
to maximize. 23.1.25
An agent (Athat particular
possesses kind an explicitof utility-based
utility functionagent).
can make rational decisions
with a general-purpose algorithm that does not depend on the specific utility function being
maximized. In this way, the “global” definition of rationality—designating as rational those
function DT-AGENT( percept ) returns an action
persistent: belief state , probabilistic beliefs about the current state of the world
rational-agent designs that can be expressed in a simple program.
action, the agent’s action
appear in Partbelief
update IV, where
statewe baseddesign decision-making
on action and percept agents that must handle the uncertainty
inherent in stochastic
calculate outcome or partially
probabilities observable environments.
for actions,
At this point,
given the descriptions
action reader may be and wondering,
current belief“Is it state
that simple? We just build agents that
maximize expected utility, and we’re
select action with highest expected utility done?” It’s true that such agents would be intelligent,
but it’s not simple. A utility-based agent has
given probabilities of outcomes and utility information to model and keep track of its environment,
tasks return
that have involved
action a great deal of research on perception, representation, reasoning,
and learning. The results of this research fill many of the chapters of this book. Choosing
that fill several
Figure 13.1more A chapters. Even with agent
decision-theoretic these that
algorithms, perfect actions.
selects rational rationality is usually
unachievable in practice because
Michael Kohlhase: of computational
Artificial Intelligence 2 complexity,
735 as we noted in Chapter 1.
2023-02-10

23.1.6 Agenda for this Chapter: Basics of Probability Theory
not, so far, explained how the agent programs come into being. In his famous early paper,
Turing (1950) considers the idea of actually programming his intelligent machines by hand.
23.2. UNCONDITIONAL PROBABILITIES 461

Our treatment of the topic “Probabilistic Reasoning” consists of this Chapter and
the next.
This Chapter: All the basic machinery at use in Bayesian networks.
chapter 24: Bayesian networks: What they are, how to build them, how to use
them.
Bayesian networks are the most wide spread and successful practical framework for
probabilistic reasoning.

Unconditional Probabilities and Conditional Probabilities: Which concepts
and properties of probabilities will be used?
Mostly a recap of things you’re familiar with from school.
Independence and Basic Probabilistic Reasoning Methods: What simple
methods are there to avoid enumeration and to deduce probabilities from other
probabilities?
A basic tool set we’ll need. (Still familiar from school?)
Bayes’ Rule: What’s that “Bayes”? How is it used and why is it important?
The basic insight about how to invert the “direction” of conditional probabilities.
Conditional Independence: How to capture and exploit complex relations be-
tween random variables?
Explains the difficulties arising when using Bayes’ rule on multiple evidences.
conditional independence is used to ameliorate these difficulties.
23.2 Unconditional Probabilities

Probabilistic Models
Definition 23.2.1. A probability theory is an assertion language for talking about
possible worlds and an inference method for quantifying the degree of belief in such
assertions.
Remark: Like logic, but for non binary belief degree.
The possible worlds are mutually exclusive: different possible worlds cannot both
be the case and exhaustive: one possible world must be the case.
This determines the set of possible worlds.
Example 23.2.2. If we roll two (distinguishable) dice with six sides, then we have
36 possible worlds: (1,1), (2,1), . . . , (6,6).

We will restrict ourselves to a discrete, countable sample space. (others more
complicated, less useful in AI)
Definition 23.2.3. A probability model ⟨Ω, P ⟩ consists of a countable set Ω of

possible worlds called the sample P
space and a probability function P : Ω→R, such
that 0≤P (ω)≤1 for all ω∈Ω and ω∈Ω P (ω) = 1.
Unconditional Probabilities, Random Variables, and Events

Definition 23.2.4. A random variable (also called random quantity, aleatory vari-
able, or stochastic variable) is a variable quantity whose value depends on possible
outcomes of unknown variables and processes we do not understand.
Definition 23.2.5. If X is a random variable and x a possible value, we will refer
to the fact X = x as an outcome and a set of outcomes as an event. The set of
possible outcomes of X is called the domain of X.
The notation uppercase “X” for a random variable, and lowercase “x” for one of
its values will be used frequently. (following Russel/Norvig)
Definition 23.2.6. Given a random variable X, P (X = x) denotes the prior
probability, or unconditional probability, that X has value x in the absence of any
other information.
Example 23.2.7. P (Cavity = T) = 0.2, where Cavity is a random variable whose
value is true iff some given person has a cavity.
Types of Random Variables

Definition 23.2.8. We say that a random variable X is finite domain, iff the
domain D of X is finite and Boolean, iff D = {T, F}.
Note: In general, random variables can have arbitrary domains. In AI-2, we restrict
ourselves to finite domain and Boolean random variables.
23.2. UNCONDITIONAL PROBABILITIES 463
Example 23.2.9. Some prior probabilities
P (Weather = sunny) = 0.7

P (Weather = rain) = 0.2
P (Weather = cloudy) = 0.08
P (Weather = snow) = 0.02
P (Headache = T) = 0.1
Unlike us, Russel and Norvig live in California . . . :-( :-(
Convenience Notations:
By convention, we denote Boolean random variables with A, B, and more gen-
eral finite domain random variables with X, Y .
For a Boolean random variable Name, we write name for the outcome Name = T
and ¬name for Name = F. (Follows Russel/Norvig as well)
Probability Distributions
Definition 23.2.10. The probability distribution for a random variable X, written
P(X), is the vector of probabilities for the (ordered) domain of X.
Example 23.2.11. Probability distributions for finite domain and Boolean random
variables
P(Headache) = ⟨0.1, 0.9⟩

P(Weather) = ⟨0.7, 0.2, 0.08, 0.02⟩
define the probability distribution for the random variables Headache and Weather.
Definition 23.2.12.
Given a subset Z ⊆ {X 1 , . . ., X n } of random variables, an event is an assignment
of values to the variables in Z. The joint probability distribution, written P(Z), lists
the probabilities of all events.
Example 23.2.13. P(Headache, Weather) is
Headache = T Headache = F
Weather = sunny P (W = sunny ∧ headache) P (W = sunny ∧ ¬headache)
Weather = rain
Weather = cloudy
Weather = snow
The Full Joint Probability Distribution

Definition 23.2.14.
Given random variables {X 1 , . . ., X n }, an atomic event is an assignment of values

to all variables.
Example 23.2.15. If A and B are Boolean random variables, then we have four
atomic events: a ∧ b, a ∧ ¬b, ¬a ∧ b, ¬a ∧ ¬b.
Definition 23.2.16.
Given random variables {X 1 , . . ., X n }, the full joint probability distribution, denoted
P(X 1 , . . ., X n ), lists the probabilities of all atomic events.
Observation:
Given random variables X 1 , . . ., X n with domains D1 , . . ., Dn , the full joint proba-
bility distribution is an n-dimensional array of size ⟨D1 , . . . ,Dn ⟩.
Example 23.2.17. P(Cavity, T oothache)
toothache ¬toothache
cavity 0.12 0.08
¬cavity 0.08 0.72
Note: All atomic events are disjoint (their pairwise conjunctions all are equivalent
to F ); the sum of all fields is 1 (the disjunction over all atomic events is T ).
Probabilities of Propositional Formulae

Definition 23.2.18.
Given random variables {X 1 , . . ., X n }, a proposition is a PL0 wff over the atoms
X i = xi where the xi are values in the domains of X i .
A function P that maps propositions into [0,1] is a probability measure if
1. P (⊤) = 1 and
P
2. for all propositions A, P (A) = e|=A P (e) where e is an atomic event.
Propositions represent sets of atomic events: the interpretations satisfying the for-
mula.
Example 23.2.19. P (cavity ∧ toothache) = 0.12 is the probability that some given
person has both a cavity and a toothache. (Note the use of cavity for Cavity = T
and toothache for Toothache = T.)
Notes:
Instead of P (a ∧ b), we often write P (a, b).
Propositions can be viewed as Boolean random variables; we will denote them
with A, B as well.
The role of clause 2 in Definition 23.2.18 is for P to “make sense”: intuitively, the probability
weight of a formula should be the sum of the weights of the interpretations satisfying it. Imagine
this was not so; then, for example, we could have P (A) = 0.2 and P (A ∧ B) = 0.8. The role of
23.3. CONDITIONAL PROBABILITIES 465
1 here is to “normalize” P so that the maximum probability is 1. (The minimum probability is 0

simply because of 1: the empty sum has weight 0).
Kolmogorov and Negation

Theorem 23.2.20 (Kolmogorow). A function P that maps propositions into [0,1]
is a probability measure if and only if
i P (⊤) = 1 and
ii’ for all propositions A, B: P (a ∨ b) = P (a) + P (b) − P (a ∧ b).
Observation: We can equivalently replace
P
ii for all propositions A, P (A) = I|=A P (I) with Kolmogorow’s (ii’).
Question: Assume we have
iii P (⊥) = 0.
How to derive from (i), (ii’), and (iii) that, for all propositions A, P (¬a) = 1−P (a)?
Believing in Kolmogorov?
Reminder 1: (i) P (⊤) = 1; (ii’) P (a ∨ b) = P (a) + P (b) − P (a ∧ b).
Reminder 2: “Probabilities model our belief.”
If P represents an objectively observable probability, the axioms clearly make
sense.
But why should an agent respect these axioms, when modeling its subjective
own belief?
Question: Do you believe in Kolmogorow’s axioms?
23.3 Conditional Probabilities

Conditional Probabilities: Intuition

Do probabilities change as we gather new knowledge?
Yes! Probabilities model our belief, thus they depend on our knowledge.
Example 23.3.1. Your “probability of missing the connection train” increases when
you are informed that your current train has 30 minutes delay.
Example 23.3.2. The “probability of cavity” increases when the doctor is informed
that the patient has a toothache.
In the presence of additional information, we can no longer use the unconditional

(prior!) probabilities.
Given propositions A and B, P (a|b) denotes the conditional probability of a (i.e.,
A = T) given that all we know is b (i.e., B = T).
Example 23.3.3. P (cavity) = 0.2 vs. P (cavity|toothache) = 0.6.
Example 23.3.4. P (cavity|toothache ∧ ¬cavity) = 0
Conditional Probabilities: Definition

Definition 23.3.5. Given propositions A and B where P (b) ̸= 0, the conditional
probability, or posterior probability, of a given b, written P (a|b), is defined as:
P (a ∧ b)
P (a|b):=
P (b)
Intuition: The likelihood of having a and b, within the set of outcomes where we
have b.
Example 23.3.6. P (cavity ∧ toothache) = 0.12 and P (toothache) = 0.2 yield
P (cavity|toothache) = 0.6.
Conditional Probability Distributions

Definition 23.3.7. Given random variables X and Y , the conditional probability
distribution of X given Y , written P(X|Y ), i.e. with a boldface P , is the table of
all conditional probabilities of values of X given values of Y .
For sets of variables: P(X 1 , . . ., X n |Y 1 , . . ., Y m ).
Example 23.3.8. P(Weather|Headache) =
Headache = T Headache = F
Weather = sunny P (W = sunny|headache) P (W = sunny|¬headache)
Weather = rain
Weather = cloudy
Weather = snow
What is The probability of sunshine given that I have a headache?

If you’re susceptible to headaches depending on weather conditions, this makes
sense. Otherwise, the two variables are independent. (see next section)
23.4. INDEPENDENCE 467
23.4 Independence
Working with the Full Joint Probability Distribution

Example 23.4.1. Consider the following full joint probability distribution:
cavity 0.12 0.08
¬cavity 0.08 0.72
How to compute P (cavity)?

Sum across the row:
P (cavity ∧ toothache) + P (cavity ∧ ¬toothache) = 0.2
How to compute P (cavity ∨ toothache)?

Sum across atomic events:
P (cavity∧toothache)+P (¬cavity∧toothache)+P (cavity∧¬toothache) = 0.28
How to compute P (cavity|toothache)?

P (cavity∧toothache)
P (toothache)
All relevant probabilities can be computed using the full joint probability distri-
bution, by expressing propositions as disjunctions of atomic events.
Working with the Full Joint Probability Distribution??

Question: Is it a good idea to use the full joint probability distribution?
Answer: No:
Given n random variables with k values each, the full joint probability distribution
contains k n probabilities.
Computational cost of dealing with this size.
Practically impossible to assess all these probabilities.
Question: So, is there a compact way to represent the full joint probability distri-
bution? Is there an efficient method to work with that representation?
Answer: Not in general, but it works in many cases. We can work directly with
conditional probabilities, and exploit conditional independence.
Eventually: Bayesian networks. (First, we do the simple case)

Independence of Events and Random Variables

Definition 23.4.2. Events a and b are independent if P (a ∧ b) = P (a) · P (b).
Given independent events a and b where P (b) ̸= 0, we have P (a|b) = P (a).
Proof:
P (a∧b)
1. By definition, P (a|b) = P (b) ,
P (a)·P (b)
2. which by independence is equal to P (b) = P (a).
Similarly, if P (a) ̸= 0, we have P (b|a) = P (b).

Definition 23.4.3. Random variables X and Y are independent if P(X, Y ) =
P(X) ⊗ P(Y ). (System of equations given by outer product!)
Independence (Examples)
Example 23.4.4.
P (Die1 = 6 ∧ Die2 = 6) = 1/36.

P (W = sunny|headache) = P (W = sunny) (unless you’re weather-sensitive;
cf. slide 753)
But toothache and cavity are NOT independent.
The fraction of “cavity” is higher within “toothache” than within “¬toothache”.
P (toothache) = 0.2 and P (cavity) = 0.2, but P (toothache ∧ cavity) = 0.12 >
0.04.
Intuition:
Independent Dependent
Oval independent of rectangle, iff split equally
Illustration: Exploiting Independence

Example 23.4.5. Consider (again) the following full joint probability distribution:
23.5. BASIC METHODS 469
cavity 0.12 0.08
¬cavity 0.08 0.72
Adding variable Weather with values sunny, rain, cloudy, snow, the full joint prob-
ability distribution contains 16 probabilities.
But your teeth do not influence the weather, nor vice versa!
Weather is independent of each of Cavity and Toothache: For all value combi-
nations (c,t) of Cavity and Toothache, and for all values w of Weather, we have
P (c ∧ t ∧ w) = P (c ∧ t) · P (w).
P(Cavity, Toothache, Weather) can be reconstructed from the separate tables
P(Cavity, Toothache) and P(Weather). (8 probabilities)
Independence can be exploited to represent the full joint probability distribution

more compactly.
Sometimes, variables are independent only under particular conditions: conditional
independence, see later.
23.5 Basic Probabilistic Reasoning Methods

The Product Rule

Definition 23.5.1. The following identity is called the product rule:
Given propositions a and b, P (a ∧ b) = P (a|b) · P (b).
Example 23.5.2. P (cavity ∧ toothache) = P (toothache|cavity) · P (cavity).
If we know the values of P (a|b) and P (b), then we can compute P (a ∧ b).
Similarly, P (a ∧ b) = P (b|a) · P (a).
Definition 23.5.3. We use the component wise array product (bold

dot) P(X, Y ) = P(X|Y ) · P(Y ) as a summary notation for the equation system
P(xi , yj ) = P(xi |yj ) · P(yj ) where i, j range over domain sizes of X and Y .
Example 23.5.4. P(Weather, Ache) = P(Weather|Ache) · P(Ache) is
P (W = sunny ∧ ache) = P (W = sunny|ache) · P (ache)

P (W = rain ∧ ache) = P (W = rain|ache) · P (ache)
··· = ···
P (W = snow ∧ ¬ache) = P (W = snow|¬ache) · P (¬ache)
Note: The outer product in P(X, Y ) = P(X) · P(Y ) is just by conincidence,

we will use P(X, Y ) = P(X) · P(Y ) instead.
The component wise array product from Definition 23.5.3 is something that Russell/Norvig (and
the literature in general) glosses over and sweeps under the rug. The problem is that it is not a
real mathematical operator, that can be defined notation independently, because it depends on
the indices in the representation. But the notation is just too convenient to bypass.
It is just a coincidence that we can use the outer product in probability distributions P(X, Y ) =
P(X) · P(Y ). Here, the outer product and component wise array product co-incide.
The Chain Rule

Definition 23.5.5 (Chain Rule). Given random variables X 1 , . . ., X n , we have
P(X 1 , . . ., X n ) = P(X n |X n−1 , . . . ,X 1 )·P(X n−1 |X n−2 , . . . ,X 1 )·. . .·P(X 2 |X 1 )·P(X 1 )
This identity is called the chain rule.

Example 23.5.6.
P (¬brush ∧ cavity ∧ toothache)

= P (toothache|cavity, ¬brush) · P (cavity, ¬brush)
= P (toothache|cavity, ¬brush) · P (cavity|¬brush) · P (¬brush)
Proof: Iterated application of Product Rule

1. P(X 1 , . . ., X n ) = P(X n |X n−1 , . . . ,X 1 ) · P(X n−1 , . . . ,X 1 ) by Product Rule.
2. In turn, P(X n−1 , . . . ,X 1 ) = P(X n−1 |X n−2 , . . . ,X 1 ) · P(X n−2 , . . . ,X 1 ), etc.
Note: This works for any ordering of the variables.
We can recover the probability of atomic events from sequenced conditional
probabilities for any ordering of the variables.
First of the four basic techniques in Bayesian networks.
Marginalization
Extracting a sub-distribution from a larger joint distribution:
Given sets X and Y of random variables, we have:
X
P(X) = P(X, y)
y∈Y
P
where y∈Y sums over all possible value combinations of Y.
Example 23.5.7. (Note: Equation system!)

X
P(Cavity) = P(Cavity, y)
y∈Toothache
P (cavity) = P (cavity, toothache) + P (cavity, ¬toothache)

P (¬cavity) = P (¬cavity, toothache) + P (¬cavity, ¬toothache)
23.5. BASIC METHODS 471
Rules of Probabilistic Reasoning

Say P (dog) = 0.4, (¬dog) ⇔ cat, and P (likeslasagna|cat) = 0.5.
Question: Is P (likeslasagna ∧ cat) is A: 0.2, B: 0.5, C: 0.475, D: 0.3

Question: Can we compute the value of P (likeslasagna), given the above infor-
mations?
We now come to a very important technique of computing unknown probabilities, which looks
almost like magic. Before we formally define it on the next slide, we will get an intuition by
considering it in the context of our dentistry example.
Normalization: Idea
Problem: We know P (cavity ∧ toothache) but don’t know P (toothache).
Step 1: Case distinction over values of Cavity: (P (toothache) as an unknown)
P (cavity ∧ toothache) 0.12

P (cavity|toothache) = =
P (toothache) P (toothache)
P (¬cavity ∧ toothache) 0.08
P (¬cavity|toothache) = =
P (toothache) P (toothache)
Step 2: Assuming placeholder α:=1/P (toothache):
P (cavity|toothache) = αP (cavity ∧ toothache) = α0.12

P (¬cavity|toothache) = αP (¬cavity ∧ toothache) = α0.08
Step 3: Fixing toothache to be true, view P (cavity ∧ toothache) vs. P (¬cavity ∧

toothache) as the “relative weights of P (cavity) vs. P (¬cavity) within toothache”.
Then normalize their summed-up weight to 1:
1 1
1 = α(0.12 + 0.08) ; α = 0.12+0.08 = 0.2 =5
α is a normalization constant scaling the sum of relative weights to 1.
To understand what is going on, consider the situation in the following diagram:
cavity
¬cavity
Now consider the areas of A1 = toothache∧cavity and A2 = toothache∧¬cavity then A1 ∪A2 =

toothache; this is exactly what we will exploit (see next slide), but we notate it slightly differently
in what will be a convenient manner in step 1.
In step 2 we only introduce a convenient placeholder α that makes subsequent argumentation
easier.
In step 3, we view A1 and A2 as “relative weights”; say that we perceive the left half as
“1” (because we already know toothache and don’t need to worry about ¬toothache), and we
re-normalize to get the desired sum αA1 + αA2 = 1.
Normalization
Question: Say we know P (likeschappi ∧ dog) = 0.32 and P (¬likeschappi ∧ dog) =
0.08. Can we compute P (likeschappi|dog)? (Chappi =b popular dog food)
Question: So what is P (likeschappi|dog)?
Normalization: Formal
Definition 23.5.8.
Pk
Given a vector ⟨w1 , . . ., wk ⟩ of numbers in [0,1] where i=1 wi ≤1, the normalization
constant α is α⟨w1 , . . ., wk ⟩:= Pk 1 w .
i=1 i
Note:
Pk
The condition i=1 w i ≤1 is needed because these will be relative weights, i.e.
case distinction over a subset of all worlds (the one fixed by the knowledge in our
conditional probability).
Example 23.5.9. α⟨0.12, 0.08⟩ = 5⟨0.12, 0.08⟩ = ⟨0.6, 0.4⟩.
Given a random variable X and an event e, we have P(X|e) = αP(X, e).

Proof:
1. For each value x of X, P (X = x|e) = P (X = x ∧ e)/P (e).
2. So all we need to prove is that α = 1/P (e).
P
3. By definition, α = 1/ x P (X = x ∧ e), so we need to prove
X
P (e) = P (X = x ∧ e)
x
which holds by marginalization.
Normalization: Formal
23.6. BAYES’ RULE 473
Example 23.5.10. α⟨P (cavity∧toothache), P (¬cavity∧toothache)⟩ = α⟨0.12, 0.08⟩,

so P (cavity|toothache) = 0.6, and P (¬cavity|toothache) = 0.4.
Another way of saying this is: “We use α as a placeholder for 1/P (e), which we
compute using the sum of relative weights by Marginalization.”
Computation Rule: Normalization+Marginalization

Given “query variable” X, “observed event” e, and “hidden variables” set Y:
X
P(X|e) = α · P(X, e) = α · ( P(X, e, y))
y∈Y
Second of the four basic techniques in Bayesian networks.
23.6 Bayes’ Rule

Bayes’ Rule
Definition 23.6.1 (Bayes’ Rule). Given propositions A and B where P (a) ̸= 0
and P (b) ̸= 0, we have:
P (b|a) · P (a)
P (a|b) =
P (b)
This equation is called Bayes’ rule.
Proof:
1. By definition, P (a|b) = PP(a∧b)
(b)
2. by the product rule P (a ∧ b) = P (b|a) · P (a) is equal to the claim.
Notation: This is a system of equations!
P(Y |X) · P(X)

P(X|Y ) =
P(Y )
Applying Bayes’ Rule

Example 23.6.2. Say we know that P (toothache|cavity) = 0.6, P (cavity) = 0.2,
and P (toothache) = 0.2.
We can we compute P (cavity|toothache): By Bayes’ rule, P (cavity|toothache) =
P (toothache|cavity)·P (cavity)
P (toothache) = 0.6·0.2
0.2 = 0.6.
Ok, but: Why don’t we simply assess P (cavity|toothache) directly?

Definition 23.6.3. P (toothache|cavity) is causal, P (cavity|toothache) is diagnos-
tic.
Intuition: Causal dependencies are robust over frequency of the causes.

Example 23.6.4. If there is a cavity epidemic then P (cavity|toothache) increases,
but P (toothache|cavity) remains the same. (only depends on how cavities “work”)
Also, causal dependencies are often easier to assess.

Intuition: “reason about causes in order to draw conclusions about symptoms”.
Bayes’ rule allows to perform diagnosis (observing a symptom, what is the cause?)
based on prior probabilities and causal dependencies.
Extended Example: Bayes’ Rule and Meningitis

Facts known to doctors:
The prior probabilities of meningitis (m) and stiff neck (s) are P (m) = 0.00002
and P (s) = 0.01.
Meningitis causes a stiff neck 70% of the time: P (s|m) = 0.7.
Doctor d uses Bayes’ Rule:

P (m|s) = P (s|m)·P
P (s)
(m)
= 0.7·0.00002
0.01 = 0.0014 ∼ 1
700 .
Even though stiff neck is strongly indicated by meningitis (P (s|m) = 0.7)

the probability of meningitis in the patient remains small.
The prior probability of stiff necks is much higher than that of meningitis.
Doctor d′ knows P (m|s) from observation; she does not need Bayes’ rule!
Indeed, but what if a meningitis epidemic erupts
Then d knows that P (m|s) grows proportionally with P (m) (d′ clueless)
Bayes Rule for Dogs

Say P (dog) = 0.4, P (likeschappi|dog) = 0.8, and P (likeschappi) = 0.5.
Question: What is P (dog|likeschappi)?
A: 0.8 B: 0.64 C: 0.9 D: 0.32?

Question: Is P (dog|likeschappi) causal or diagnostic?
Question: Is P (likeschappi|dog) causal or diagnostic?

23.7. CONDITIONAL INDEPENDENCE 475
23.7 Conditional Independence

Bayes’ Rule with Multiple Evidence

Example 23.7.1. Say we know from medicinical studies that P (cavity) = 0.2,
P (toothache|cavity) = 0.6, P (toothache|¬cavity) = 0.1, P (catch|cavity) = 0.9,
and P (catch|¬cavity) = 0.2.
Now, in case we did observe the symptoms toothache and catch (the dentist’s probe
catches in the aching tooth), what would be the likelihood of having a cavity? What
is P (cavity|toothache ∧ catch)?
Trial 1: Bayes’ rule
P (toothache ∧ catch|cavity) · P (cavity)

P (cavity|toothache ∧ catch) =
P (toothache ∧ catch)
Trial 2: Normalization P(X|e) = αP(X, e) then Product Rule P(X, e) =

P(e|X) · P(X), with X = Cavity, e = toothache ∧ catch:
P(Cavity|catch ∧ toothache) = α · P(toothache ∧ catch|Cavity) · P(Cavity)
P (cavity|catch ∧ toothache) = α · P (toothache ∧ catch|cavity) · P (cavity)
P (¬cavity|catch ∧ toothache) = αP (toothache ∧ catch|¬cavity)P (¬cavity)
Bayes’ Rule with Multiple Evidence, ctd.

P(Cavity|toothache ∧ catch) = αP(toothache ∧ catch|Cavity) · P(Cavity)
Question: So, is everything fine?
Answer: No! We need P(toothache ∧ catch|Cavity), i.e. causal dependencies for
all combinations of symptoms! (≫ 2, in general)
Question: Are Toothache and Catch independent?

Answer: No. If a probe catches, we probably have a cavity which probably causes
toothache.
But: They are conditionally independent given the presence or absence of a
cavity!
Conditional Independence
Definition 23.7.2. Given sets of random variables Z1 , Z2 , and Z, we say that Z1

and Z2 are conditionally independent given Z if:
P(Z1 , Z2 |Z) = P(Z1 |Z) · P(Z2 |Z)
We alternatively say that Z1 is conditionally independent of Z2 given Z.

Example 23.7.3. Catch and Toothache are conditionally independent given Cavity.
For cavity: this may cause both, but they don’t influence each other.
For ¬cavity: something else causes catch and/or toothache.
So we have:
P(Toothache, Catch|cavity) = P(Toothache|cavity) · P(Catch|cavity)

P(Toothache, Catch|¬cavity) = P(Toothache|¬cavity) · P(Catch|¬cavity)
Note: The definition is symmetric regarding the roles of Z1 and Z2 : Toothache

is conditionally independent of Cavity.
But there may be dependencies within Z1 or Z2 , e.g. Z2 = {Toothache, Sleeplessness}.
Conditional Independence, ctd.

If Z1 and Z2 are conditionally independent given Z, then P(Z1 |Z2 , Z) = P(Z1 |Z).
Proof:
P(Z1 ,Z2 ,Z)
1. By definition, P(Z1 |Z2 , Z) = P(Z2 ,Z)
,Z2 |Z)·P(Z)
2. which by product rule is equal to P(Z1P(Z 2 ,Z)
3. which by conditional independence is equal to P(Z1 |Z)·P(Z 2 |Z)·P(Z)

P(Z2 ,Z) .
P(Z2 |Z)·P(Z)
4. Since P(Z2 ,Z) = 1 this proves the claim.
Example 23.7.4. Using {Toothache} as Z1 , {Catch} as Z2 , and {Cavity} as Z:

P(Toothache|Catch, Cavity) = P(Toothache|Cavity).
In the presence of conditional independence, we can drop variables from the right-
hand side of conditional probabilities.
Third of the four basic techniques in Bayesian networks.

Last missing technique: “Capture variable dependencies in a graph”; illustration
see next slide, details see chapter 24
Exploiting Conditional Independence: Overview

1. Graph captures variable dependencies: (Variables X 1 , . . ., X n )
23.7. CONDITIONAL INDEPENDENCE 477
Cavity
Toothache Catch
Given evidence e, want to know P(X|e).

Remaining vars: Y.
2. Normalization+Marginalization:
P
P(X|e) = α · P(X, e); if Y ̸= ∅ then P(X|e) = α · ( y∈Y P(X, e, y))
A sum over atomic events!
3. Chain rule: Order X 1 , . . ., X n consistently with dependency graph.
P(X 1 , . . ., X n ) = P(X n |X n−1 , . . . ,X 1 ) · P(X n−1 |X n−2 , . . . ,X 1 ) · . . . · P(X 1 )
4. Exploit Conditional Independence: Instead of P(X i |X i−1 , . . . ,X 1 ), with

previous slide we can use P(X i |Parents(X i )).
Bayesian networks!
Exploiting Conditional Independence: Example

1. Graph captures variable dependencies: (See previous slide.)
Given toothache, catch, want P(Cavity|toothache, catch). Remaining vars: ∅.

P(Cavity|toothache, catch) = α · P(Cavity, toothache, catch)
3. Chain rule:
Order X 1 = Cavity, X 2 = Toothache, X 3 = Catch.
P(Cavity, toothache, catch) =

P(catch|toothache, Cavity) · P(toothache|Cavity) · P(Cavity)
4. Exploit Conditional independence:
Instead of P(catch|toothache, Cavity) use P(catch|Cavity).
Thus:
P(Cavity|toothache, catch)
= α · P(catch|Cavity) · P(toothache|Cavity) · P(Cavity)
= α · ⟨0.9 · 0.6 · 0.2, 0.2 · 0.1 · 0.8⟩
= α · ⟨0.108, 0.016⟩
So: α ≈ 8.06 and P(cavity|toothache ∧ catch) ≈ 0.87.
Naive Bayes Models

Definition 23.7.5. A Bayesian network in which a single cause directly influences
a number of effects, all of which are conditionally independent, given the cause is
called a naive Bayes model or Bayesian classifier.
Observation 23.7.6.
In a naive Bayes model, the full joint probability distribution can be written as
Y
P(cause|effect1 , . . ., effectn ) = α⟨effect1 , . . ., effectn ⟩·P(cause)· P(effecti |cause)
i
Note: This kind of model is called “naive” since it is often used as a simplifying
model if the effects are not conditionally independent after all.
It is also called idiot Bayes model by Bayesian fundamentalists.
In practice, naive Bayes models can work surprisingly well, even when the conditional
independence assumption is not true.
Example 23.7.7. The dentistry example is a (true) naive Bayes model.
Questionnaire
Consider the random variables X 1 = Animal, X 2 = LikesChappi, and X 3 =
LoudNoise, and X 1 has values {dog, cat, other}, X 2 and X 3 are Boolean.
Question: Which statements are correct?
(A) Animal is independent of LikesChappi.
(B) LoudNoise is independent of LikesChappi.
(C) Animal is conditionally independent of LikesChappi given LoudNoise.
(D) LikesChappi is conditionally independent of LoudNoise given Animal.
Think about this intuitively: Given both values for variable X, are the chances of Y
being true higher for one of these (fixing value of the third variable where specified)?
23.8 The Wumpus World Revisited

will fortify our intuition about naive Bayes models with a variant of the Wumpus world we looked
23.8. THE WUMPUS WORLD REVISITED 479
at Example 23.1.17 to understand whether logic was up to the job of guiding an agent in the
Wumpus cave.
Wumpus World Revisited
Example 23.8.1 (The Wumpus is Back).
We have a maze where

pits cause a breeze in neighboring squares
Every square except (1,1) has a 20% pit
probability. (unfair otherwise)
we forget wumpus and gold for now
Where does the agent should go, if there is
breeze at (1,2) and (2,1)?
Pure logical inference can conclude nothing
about which square is most likely to be safe!
Idea: Let’s evaluate our probabilistic reasoning machinery, if that can help!
Wumpus: Probabilistic Model
Boolean random variables (only for the observed squares)
P i,j : pit at square (i, j)

B i,j : breeze at square (i, j)
Full joint probability distribution

1. P(P 1,1 , . . . , P 4,4 , B 1,1 , B 1,2 , B 2,1 ) = P(B 1,1 , B 1,2 , B 2,1 |P 1,1 , . . . , P 4,4 ) · P(P 1,1 , . . . , P 4,4 )
(Product Rule)
Q4,4
2. P(P 1,1 , . . . P 4,4 ) = i,j=1,1 P(P i,j ) (pits are spread independently)
3. For a particular configuration p1,1 , . . ., p4,4 with pi,j ∈{T, F}, n pits, and P (pi,j ) =
0.2 we have P (p1,1 , . . ., p4,4 ) = 0.2n · 0.816−n
Wumpus: Query and Simple Reasoning

We have evidence in our example:
b = ¬b1,1 ∧ b1,2 ∧ b2,1 and

κ = ¬p1,1 ∧ ¬p1,2 ∧ ¬p2,1
We are interested in answering queries such as
P (P 1,3 |κ, b). (pit in (1, 3) given evidence)
!
Observation: The answer can be computed by enumeration of the full joint
probability distribution.
Standard Approach: Let U be the variables P i,j except P 1,3 and κ, then
X
P (P 1,3 |κ, b) = P(P 1,3 , u, κ, b)
u∈U
Problem: Need to explore all possible values of variables in U (212 = 4096

terms!)
Can we do better? (faster; with less computation)
Wumpus: Conditional Independence

Observation 23.8.2.
The observed breezes are conditionally in-

dependent of the other variables given the
known, frontier, and query variables.
We split the set of hidden variables into fringe and other variables: U = F ∪ O
where F is the fringe and O the rest.
Corollary 23.8.3. P (b|P 1,3 , κ, U ) = P (b|P 1,3 , κ, F ) (by conditional
independence)
Now: let us exploit this formula.
Wumpus: Reasoning
We calculate:
X
P (P 1,3 |κ, b) = α( P(P 1,3 , u, κ, b))
u∈U
X
= α( P(b|P 1,3 , κ, u) · P(P 1,3 , κ, u))
u∈U
XX
= α( P(b|P 1,3 , κ, f , o) · P(P 1,3 , κ, f , o))
f ∈F o∈O
X X
= α( P(b|P 1,3 , κ, f ) · ( P(P 1,3 , κ, f , o)))
f ∈F o∈O
X X
= α( P(b|P 1,3 , κ, f ) · ( P(P 1,3 ) · P (κ) · P (f ) · P (o)))
f ∈F o∈O
X X
= αP(P 1,3 )P (κ)( P(b|P 1,3 , κ, f ) · P (f ) · ( P (o)))
f ∈F o∈O
α′ P (P 1,3 )(
X
= P(b|P 1,3 , κ, f ) · P (f ))
f ∈F
for α′ :=αP (κ) as

P
o∈O P (o) = 1.
Wumpus: Solution
We calculate using the product X rule and conditional independence (see above)
′
P (P 1,3 |κ, b) = α · P (P 1,3 ) · ( P(b|P 1,3 , κ, f ) · P (f ))
f ∈F
Let us explore possible models (values) of Fringe that are F compatible with ob-
servation b.
P(P 1,3 |κ, b) = α′ · ⟨0.2 · (0.04 + 0.16 + 0.16), 0.8 · (0.04 + 0.16)⟩ = ⟨0.31, 0.69⟩
P(P 3,1 |κ, b) = ⟨0.31, 0.69⟩ by symmetry

P(P 2,2 |κ, b) = ⟨0.86, 0.14⟩ (definitely avoid)
23.9 Conclusion
Summary
Uncertainty is unavoidable in many environments, namely whenever agents do not
have perfect knowledge.
Probabilities express the degree of belief of an agent, given its knowledge, into an
event.
Conditional probabilities express the likelihood of an event given observed evidence.
Assessing a probability =
b use statistics to approximate the likelihood of an event.
Bayes’ rule allows us to derive, from probabilities that are easy to assess, probabil-
ities that aren’t easy to assess.
Given multiple evidence, we can exploit conditional independence.
Bayesian networks (up next) do this, in a comprehensive, computational manner.
Reading: Chapter 13: Quantifying Uncertainty [RN03].

Content: Sections 13.1 and 13.2 roughly correspond to my “Introduction” and “Probability
Theory Concepts”. Section 13.3 and 13.4 roughly correspond to my “Basic Probabilistic Inference”.
Section 13.5 roughly corresponds to my “Bayes’ Rule” and “Multiple Evidence”.
In Section 13.6, RN go back to the Wumpus world and discuss some inferences in a probabilistic
version thereof.
Overall, the content is quite similar. I have added some examples, have tried to make a few
subtle points more explicit, and I indicate already how these techniques will be used in Bayesian
networks. RN gives many complementary explanations, nice as additional background reading.
Chapter 24
Probabilistic Reasoning: Bayesian

Networks
24.1 Introduction
Reminder: Our Agenda for This Topic

Our treatment of the topic “Probabilistic Reasoning” consists of this and last section.
chapter 23: All the basic machinery at use in Bayesian networks.
This section: Bayesian networks: What they are, how to build them, how to
use them.
The most wide-spread and successful practical framework for probabilistic rea-
soning.
Reminder: Our Machinery

1. Graph captures variable dependencies: (Variables X 1 , . . ., X n )
Cavity
Toothache Catch
Given evidence e, want to know P(X|e). Remaining vars: Y.
X
P(X|e) = αP(X, e) = α P(X, e, y)
y∈Y
A sum over atomic events!
483
484 CHAPTER 24. PROBABILISTIC REASONING: BAYESIAN NETWORKS
3. Chain rule: X 1 , . . ., X n consistently with dependency graph.
4. Exploit conditional independence: Instead of P(X i |X i−1 , . . . ,X 1 ), we can use

P(X i |Parents(X i )).
Bayesian networks!
Some Applications
A ubiquitous problem: Observe “symptoms”, need to infer “causes”.
Medical Diagnosis Face Recognition
Self-Localization Nuclear Test Ban

What is a Bayesian Network?: i.e. What is the syntax?
Tells you what Bayesian networks look like.
What is the Meaning of a Bayesian Network?: What is the semantics?
Makes the intuitive meaning precise.

Constructing Bayesian Networks: How do we design these networks? What
effect do our choices have on their size?
Before you can start doing inference, you need to model your domain.
Inference in Bayesian Networks: How do we use these networks? What is the

associated complexity?
24.2. WHAT IS A BAYESIAN NETWORK? 485
Inference is our primary purpose. It is important to understand its complexities

and how it can be improved.
24.2 What is a Bayesian Network?

What is a Bayesian Network? (Short: BN)

What do the others say?
“A Bayesian network is a methodology for representing the full joint probability

distribution. In some cases, that representation is compact.”
“A Bayesian network is a graph whose nodes are random variables X i and whose
edges ⟨X j , X i ⟩ denote a direct influence of X j on X i . Each node X i is associ-
ated with a conditional probability table (CPT), specifying P(X i |Parents(X i )).”
“A Bayesian network is a graphical way to depict conditional independence re-
lations within a set of random variables.”
A Bayesian network (BN) represents the structure of a given domain. Probabilistic
inference exploits that structure for improved efficiency.
BN inference: Determine the distribution of a query variable X given observed

evidence e: P(X|e).
John, Mary, and My Brand-New Alarm

Example 24.2.1 (From Russell/Norvig).
I got very valuable stuff at home. So I bought an alarm. Unfortunately, the
alarm just rings at home, doesn’t call me on my mobile.
I’ve got two neighbors, Mary and John, who’ll call me if they hear the alarm.
The problem is that, sometimes, the alarm is caused by an earthquake.
Also, John might confuse the alarm with his telephone, and Mary might miss
the alarm altogether because she typically listens to loud music.
Question: Given that both John and Mary call me, what is the probability of a
burglary?
John, Mary, and My Alarm: Designing the Network

Cooking Recipe:
(1) Design the random variables X 1 , . . ., X n ;

(2) Identify their dependencies;
(3) Insert the conditional probability tables P(X i |Parents(X i )).
Example 24.2.2 (Let’s cook!).

(1) Random variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCalls.
(2) Dependencies: Burglaries and earthquakes are independent (this is actually
debatable ; design decision!) the alarm might be activated by either. John and
Mary call if and only if they hear the alarm (they don’t care about earthquakes)
(3) Conditional probability tables: Assess the probabilities, see next slide.
John, Mary, and My Alarm: The Bayesian network

Example 24.2.3.
P(E)
P(B)
Burglary .001
Earthquake .002
B E P(A)
T T .95
Alarm T F .94
F T .29
F F .001
A P(J)
A P(M)
T .90
JohnCalls F .05 MaryCalls T
F
.70
.01
Note:
In each P(X i |Parents(X i )), we show only P(X i = T|Parents(X i )). We don’t show
P(X i = F|Parents(X i )) which is 1 − P(X i = T|Parents(X i )).
The Syntax of Bayesian Networks

Definition 24.2.4 (Bayesian Network).
P(E)
P(B)
Burglary .001
Earthquake .002
B E P(A)
T T .95
Alarm T F .94
F T .29
F F .001
A P(J)
A P(M)
T .90
F
.70
.01
24.3. WHAT IS THE MEANING OF A BAYESIAN NETWORK? 487
Given random variables X 1 , . . ., X n with finite domains D1 , . . ., Dn , a Bayesian net-

work (also belief network or probabilistic network) is a DAG B:=⟨{X 1 , . . ., X n }, E⟩.
We denote Parents(X i ):={X j |(X j ,X i )∈E}. Each X i is associated with a function
Y
CPT(X i ) : Di × Dj →[0,1]
X j ∈Parents(X i )
the conditional probability table.

Definition 24.2.5. Related formalisms summed up under the term graphical mod-
els.
24.3 What is the Meaning of a Bayesian Network?

The Semantics of Bayesian Networks: Illustration
Burglary Earthquake
Alarm
JohnCalls MaryCalls
Alarm depends on Burglary and Earthquake.
MaryCalls only depends on Alarm. P(MaryCalls|Alarm, Burglary) = P(MaryCalls|Alarm)

Bayesian networks represent sets of independence assumptions.
The Semantics of Bayesian Networks: Illustration, ctd.

Observation 24.3.1. Each node X in a BN is conditionally independent of its
non-descendants given its parents Parents(X).
U1 ... Um
X
Z1j Z nj
Y1 ... Yn
Question: Why non-descendants of X?

Intuition: Given that BNs are acyclic, these are exactly those nodes that could
have an edge into X.
The Semantics of BNs
Burglary Earthquake
Alarm
JohnCalls MaryCalls
Question: Given the value of Alarm, MaryCalls is independent of?

The Semantics of Bayesian Networks: Formal

P(E)
P(B)
Burglary .001
Earthquake .002
B E P(A)
T T .95
Alarm T F .94
F T .29
F F .001
A P(J)
A P(M)
T .90
F
.70
.01
24.3. WHAT IS THE MEANING OF A BAYESIAN NETWORK? 489
Definition 24.3.2. Let ⟨X , E⟩ be a Bayesian network, X∈X , and E ∗ the transitive

reflexive closure of E, then NonDesc(X):={Y |(X,Y )̸∈E ∗ }\Parents(X) is the set
of non-descendents of X.
Definition 24.3.3. Given a Bayesian network B:=⟨X , E⟩, we identify B with the
following two assumptions:
(A) X∈X is conditionally independent of NonDesc(X) given Parents(X).
(B) For all values x of X∈X , and all value combinations of Parents(X), we have
P (x|Parents(X)) = CPT(x, Parents(X)).
Recovering the Full Joint Probability Distribution

Intuition: A Bayesian network is a methodology for representing the full joint
probability distribution.
Problem:
How to recover the full joint probability distribution P(X 1 , . . ., X n ) from B:=⟨{X 1 , . . ., X n }, E⟩?
Chain Rule:
For any ordering X 1 , . . ., X n , we have:
Choose X 1 , . . ., X n consistent with B: X j ∈Parents(X i ) ; j < i.

Observation 24.3.4 (Exploiting Conditional Independence).
With Definition 24.3.3 (A), we can use P(X i |Parents(X i )) instead of P(X i |X i−1 , . . . ,X 1 ):
n
Y
P(X 1 , . . ., X n ) = P(X i |Parents(X i ))
i=1
The distributions P(X i |Parents(X i )) are given by Definition 24.3.3 (B).

Same for atomic events P (X 1 , . . ., X n ).
Observation 24.3.5 (Why “acyclic”?). for cyclic B, this does NOT hold, indeed
cyclic BNs may be self contradictory. (need a consistent ordering)
Note:
If there is a cycle, then any ordering X 1 , . . ., X n will not be consistent with the BN; so in the
chain rule on X 1 , . . ., X n there comes a point where we have P(X i |X i−1 , . . . , X 1 ) in the chain but
P(X i |Parents(X i )) in the definition of distribution, and Parents(X i )̸⊆{X i−1 , . . . ,X 1 } but then
the products are different. So the chain rule can no longer be used to prove that we can reconstruct
the full joint probability distribution. In fact, cyclic Bayesian network contain ambiguities (several
interpretations possible) and may be self-contradictory (no probability distribution matches the
Bayesian network).
Recovering a Probability for John, Mary, and the Alarm

Example 24.3.6. John and Mary called because there was an alarm, but no
earthquake or burglary
P (j, m, a, ¬b, ¬e) = P (j|a) · P (m|a) · P (a|¬b, ¬e) · P (¬b) · P (¬e)

= 0.9 ∗ 0.7 ∗ 0.001 ∗ 0.999 ∗ 0.998
= 0.00062
P(E)
P(B)
Burglary .001
Earthquake .002
B E P(A)
T T .95
Alarm T F .94
F T .29
F F .001
A P(J)
A P(M)
T .90
F
.70
.01
Meaning of Bayesian Networks
Animal
LoudNoise LikesChappi

Say B is the Bayesian network above. Which statements are correct?
(A) Animal is independent of LikesChappi.

(B) LoudNoise is independent of LikesChappi.
(C) Animal is conditionally independent of LikesChappi given LoudNoise.
(D) LikesChappi is conditionally independent of LoudNoise given Animal.
Think about this intuitively: Given both values for variable X, is the chances of Y
being true higher for one of these (fixing value of third var where specified)?
Answers: reserved for the plenary sessions ; be there!
24.4 Constructing Bayesian Networks

24.4. CONSTRUCTING BAYESIAN NETWORKS 491
Constructing Bayesian Networks

BN construction algorithm:
1. Initialize BN :=⟨{X 1 , . . ., X n }, E⟩ where E = ∅.
2. Fix any order of the variables, X 1 , . . ., X n .
3. for i := 1, . . . , n do
a. Choose a minimal set Parents(X i ) ⊆ {X 1 , . . . ,X i−1 } so that
P(X i |X i−1 , . . . ,X 1 ) = P(X i |Parents(X i ))
b. For each X j ∈Parents(X i ), insert (X j ,X i ) into E.

c. Associate X i with CPT(X i ) corresponding to P(X i |Parents(X i )).
Attention: Which variables we need to include into
Parents(X i ) depends on what “{X 1 , . . . ,Xseqi − 1}” is . . . !
The size of the resulting BN depends on the chosen order X 1 , . . ., X n .
The size of a Bayesian network is not a fixed property of the domain. It depends
on the skill of the designer.
John and Mary Depend on the Variable Order!

Example 24.4.1. MaryCalls, JohnCalls, Alarm, Burglary, Earthquake.
Note: For ?? we try to determine whether – given different value assignments to potential parents
– the probability of Xi being true differs? If yes, we include these parents. In the particular case:
1. M to J yes because the common cause may be the alarm.

2. M, J to A yes because they may have heard alarm.

3. A to B yes because if A then higher chance of B.
4. However, M/J to B no because M/J only react to the alarm so if we have the value of A then
values of M/J don’t provide more information about B.
5. A to E yes because if A then higher chance of E.
6. B to E yes because, if A and not B then chances of E are higher than if A and B.
John and Mary Depend on the Variable Order! Ctd.

Example 24.4.2. MaryCalls, JohnCalls, Earthquake, Burglary, Alarm.
Again: Given different value assignments to potential parents, does the probability of Xi being
true differ? If yes, include these parents.
1. M to J as before.
2. M, J to E as probability of E is higher if M/J is true.
3. Same for B; E to B because, given M and J are true, if E is true as well then prob of B is
lower than if E is false.
4. M /J/B/E to A because if M /J/B/E is true (even when changing the value of just one of
these) then probability of A is higher.
John and Mary, What Went Wrong?

Intuition: These BNs link from symptoms to causes! (P(Cavity|Toothache)) Even

though M and J are conditionally independent given A, they are not independent
without any additional evidence; thus we don’t “see” their conditional independence
unless we ordered A before M and J! ; We organized the domain in the wrong
way here.
We fail to identify many conditional independence relations (e.g., get dependencies
between conditionally independent symptoms).
Also recall: Conditional probabilities P(Symptom|Cause) are more robust and
often easier to assess than P(Cause|Symptom).
Rule of Thumb: We should order causes before symptoms.
Compactness of Bayesian Networks

Definition 24.4.3.
Given random variables X 1 , . . ., X n with finite domains D1 , . . ., Dn , the size of
B:=⟨{X 1 , . . ., X n }, E⟩ is defined as
n
X Y
size(B):= #(Di ) · #(Dj )
i=1 X j ∈Parents(X i )
Note: size(B) =
b The total number of entries in the CPTs.
Note: Smaller BN ; need to assess less probabilities, more efficient inference.

Observation 24.4.4.
Qn
Explicit full joint probability distribution has size i=1 #(Di ).
Observation 24.4.5.
If #(Parents(X i )) ≤ k for every X i , and Dmax is the largest random variable
k+1
domain, then size(B) ≤ n#(Dmax ) .
Example 24.4.6.
For #(Dmax ) = 2, n = 20, k = 4 we have 220 = 1048576 probabilities, but a
Bayesian network of size ≤ 20 · 25 = 640 . . . !
Qn
In the worst case, size(B) = n · i=1 #(Di ), namely if every variable depends on
all its predecessors in the chosen order.
Intuition: BNs are compact if each variable is directly influenced only by few of
its predecessor variables.
Representing Conditional Distributions: Deterministic Nodes
Problem: Even if max(Parents) is small, the CPT has 2k entries. (worst-case)

Idea: Usually CPTs follow standard patterns called canonical distributions.
only need to determine pattern and some values.

Definition 24.4.7. A node X in a Bayesian network is called deterministic, if its
value is completely determined by the values of Parents(X).
Example 24.4.8 (Logical Dependencies).
In the network on the right, the node European German
is deterministic, the CPT corresponds to a log- Greek French
ical disjunction, i.e. P (european) = P (greek ∨
german ∨ french). European
Example 24.4.9 (Numerical Dependencies).
In the network on the right, the node Dropouts

Students is deterministic, the CPT corre- Inscriptions Graduations
sponds to a sum, i.e. P (S = i − d − g) =
P (I = i) + P (D = d) + P (G = g). Students
Intuition: Deterministic nodes model direct, causal relationships.
Representing Conditional Distributions: Noisy Nodes
Problem: Sometimes, values of nodes are only “almost deterministic”.

(uncertain, but mostly logical)
Idea: Use “noisy” logical relationships. (generalize logical ones softly to [0,1])
Example 24.4.10 (Inhibited Causal Dependencies).
In the network on the right, deterministic disjunction Flu

for the node Fever is incorrect, since the diseases some- Cold Malaria
times fail to develop fever. The causal relation between
parent and child is inhibited. Fever
Assumptions: We make the following assumptions for modeling Example 24.4.10:

1. Cold, Flu, and Malaria is a complete list of fever causes (add a leak node for the
others otherwise).
2. Inhibitions of the parents are independent.
Thus we can model the inhibitions by individual inhibition factors qd .

Definition 24.4.11. The CPT of a Q noisy disjunction node X in a Bayesian network
is given by P (X i |Parents(X i )) = {j|X j =T} qj , where the qi are the inhibition
factors of X i ∈Parents(X).

Example 24.4.12. We have the following inhibition factors for Example 24.4.10:
qcold = P (¬fever|cold, ¬flu, ¬malaria) = 0.6

qflu = P (¬fever|¬cold, flu, ¬malaria) = 0.2
qmalaria = P (¬fever|¬cold, ¬flu, malaria) = 0.1
If
Qwe model Fever as a noisy disjunction node, then the general rule P (X i |Parents(X i )) =
{j|X j =T} qj for the CPT gives the following table:
Cold Flu Malaria P (Fever) P (¬Fever)

F F F 0.0 1.0
F F T 0.9 0.1
F T F 0.8 0.2
F T T 0.98 0.02 = 0.2 · 0.1
T F F 0.4 0.6
T F T 0.94 0.06 = 0.6 · 0.1
T T F 0.88 0.12 = 0.6 · 0.2
T T T 0.988 0.012 = 0.6 · 0.2 · 0.1

Observation 24.4.13. In general, noisy logical relationships in which a variable
depends on k parents can be described by O(k) parameters instead of O(2k ) for the
full conditional probability table. This can make assessment (and learning) tractable.
Example 24.4.14. The CPCS network [Pra+94] uses noisy-OR and noisy-MAX dis-
tributions to model relationships among diseases and symptoms in internal medicine.
With 448 nodes and 906 links, it requires only 8,254 values instead of 133,931,430
for a network with full CPTs.

Constructing Bayesian Networks

Question:
What is the Bayesian network we get by constructing according to the ordering
1. X 1 = LoudNoise, X 2 = Animal, X 3 = LikesChappi?
2. X 1 = LoudNoise, X 2 = LikesChappi, X 3 = Animal?
24.5 Inference in Bayesian Networks

Inference for Mary and John

Intuition: Observe evidence variables and draw conclusions on query variables.
Example 24.5.1.
P(E)
P(B)
Burglary .001
Earthquake .002
B E P(A)
T T .95
Alarm T F .94
F T .29
F F .001
A P(J)
A P(M)
T .90
F
.70
.01
What is P(Burglary|johncalls)?
What is P(Burglary|johncalls, marycalls)?
Probabilistic Inference Tasks in Bayesian Networks

Definition 24.5.2 (Probabilistic Inference Task).
Given random variables X 1 , . . ., X n , a probabilistic inference task consists of a set
X ⊆ {X 1 , . . ., X n } of query variables, a set E ⊆ {X 1 , . . ., X n } of evidence vari-
ables, and an event e that assigns values to E. We wish to compute the conditional
probability distribution P(X|e).
24.5. INFERENCE IN BAYESIAN NETWORKS 497
Y:={X 1 , . . ., X n }\X ∪ E are the hidden variables.

Notes:
We assume that a Bayesian network B for X 1 , . . ., X n is given.
In the remainder, for simplicity, X = {X} is a singleton.
Example 24.5.3. In P(Burglary|johncalls, marycalls), X = Burglary, e = johncalls, marycalls,
and Y = {Alarm, EarthQuake}.
Inference by Enumeration: The Principle (A Reminder!)

Problem: Given evidence e, want to know P(X|e).
Hidden variables: Y.
1. Bayesian network: Construct a Bayesian network B that captures variable
dependencies.
P
P(X|e) = αP(X, e); if Y ̸= ∅ then P(X|e) = α( y∈Y P(X, e, y))
Recover the summed-up probabilities P(X, e, y) from B!
3. Chain Rule:
Order X 1 , . . ., X n consistent with B.
4. Exploit conditional independence:

Instead of P(X i |X i−1 , . . . ,X 1 ), use P(X i |Parents(X i )).
Given a Bayesian network B, probabilistic inference tasks can be solved as sums of
products of conditional probabilities from B.
Sum over all value combinations of hidden variables.
Inference by Enumeration: John and Mary

P(E)
P(B)
Burglary .001
Earthquake .002
B E P(A)
T T .95
Alarm T F .94
F T .29
F F .001
A P(J)
A P(M)
T .90
F
.70
.01
Want: P(Burglary|johncalls, marycalls).

Hidden variables: Y = {Earthquake, Alarm}.
Normalization+Marginalization:
XX
P(B|j, m) = αP(B, j, m) = α( P(B, j, m, vE , vA ))
vE vA
Order: X 1 = B, X 2 = E, X 3 = A, X 4 = J, X 5 = M .
Chain rule and conditional independence:

XX
P(B|j, m) = α( P(B) · P (vE ) · P(vA |B, vE ) · P (j|vA ) · P (m|vA ))
vE vA
Inference by Enumeration: John and Mary, ctd.

Move variables outwards: (until we hit the first parent):
X X
P(B|j, m) = α · P(B) · ( P (vE ) · ( P(vA |B, vE ) · P (j|vA ) · P (m|vA )))
vE vA
Note: This step is actually done by the pseudo-code, implicitly in the sense that
in the recursive calls to enumerate-all we multiply our own prob with all the rest.
That is valid because, the variable ordering being consistent, all our parents are
already here which is just another way of saying “my own prob does not depend on
the variables in the rest of the order”.
The probabilities of the outside-variables multiply the entire “rest of the sum”
Chain rule and conditional independence, ctd.:
P(B|j, m)
X X
= αP(B)( P (vE )( P(vA |B, vE )P (j|vA )P (m|vA )))
vE vA
  a  
z }| { 

  P (a|b, e)P (j|a)P (m|a)  
 P (e) ·  e 

  + P (¬a|b, e)P (j|¬a)P (m|¬a) 



 | {z }  
¬a
= α · P (b) · 
 
 a  
 z }| { 
 
P (a|b, ¬e)P (j|a)P (m|a)
   
 
 + P (¬e) ·  ¬e 
  + P (¬a|b, ¬e)P (j|¬a)P (m|¬a) 
 
| {z } 
¬a
= α⟨0.00059224, 0.0014919⟩ ≈ ⟨0.284, 0.716⟩
This computation can be viewed as a “search tree”! (see next slide)

24.5. INFERENCE IN BAYESIAN NETWORKS 499
The Evaluation of P (b|j, m), as a “Search Tree”
Inference by enumeration = a tree with “sum nodes” branching over values of hidden
variables, and with non-branching “multiplication nodes”.
Inference by Enumeration: Variable Elimination

Inference by Enumeration:
Evaluates the tree in a depth-first manner.
space complexity: linear in the number of variables.
time complexity: exponential in the number of hidden variables, e.g. O(2#(Y) )
in case these variables are Boolean.
Can we do better than this?
Definition 24.5.4. Variable elimination is a BNI algorithm that avoids

repeated computation, and (see below)
irrelevant computation. (see below)
In some special cases, variable elimination runs in polynomial time.
Variable Elimination: Sketch of Ideas

Avoiding repeated computation: Evaluate expressions from right to left, storing
all intermediate results.
For query P (B|j, m):

1. CPTs of BN yield factors (probability tables):

X X
P(B|j, m) = α · P(B) ·( P (vE ) P(vA |B, vE ) · P (j|vA ) · P (m|vA ))
| {z } v | {z } v | {z } | {z } | {z }
E A
f1 (B) f2 (E) f3 (A,B,E) f4 (A) f5 (A)
2. Then the computation is performed in terms of factor product and summing out
variables from factors:
X X
P(B|j, m) = α · f1 (B) · ( f2 (E) · ( f3 (A, B, E) · f4 (A) · f5 (A)))
vE vA
Avoiding irrelevant computation: Repeatedly remove hidden variables that are

leaf nodes.
For query P (JohnCalls|burglary):
X X X
P(J|b) = α · P (b) · ( P (vE ) · ( P (vA |b, vE ) · P(J|vA ) · ( P (vM |vA ))))
vE vA vM
The rightmost sum equals 1 and can be dropped.
The Complexity of Exact Inference

Definition 24.5.5. A graph G is called singly connected, or a polytree (otherwise
multiply connected), if there is at most one undirected path between any two nodes
in G.
Theorem 24.5.6 (Good News). On singly connected Bayesian networks, variable
elimination runs in polynomial time.
Is our BN for Mary & John a polytree? (Yes.)
Theorem 24.5.7 (Bad News). For multiply connected Bayesian networks, prob-
abilistic inference is #P-hard. (#P is harder than NP, i.e.
NP ⊆ #P)
So?: Life goes on . . . In the hard cases, if need be we can throw exactitude to
the winds and approximate.
Example 24.5.8. Sampling techniques as in MCTS.
24.6 Conclusion
Summary
Bayesian networks (BN) are a wide-spread tool to model uncertainty, and to reason
about it. A BN represents conditional independence relations between random vari-

ables. It consists of a graph encoding the variable dependencies, and of conditional
probability tables (CPTs).
Given a variable order, the BN is small if every variable depends on only a few of
its predecessors.
Probabilistic inference requires to compute the probability distribution of a set
of query variables, given a set of evidence variables whose values we know. The
remaining variables are hidden.
Inference by enumeration takes a BN as input, then applies Normalization+Marginalization,

the chain rule, and exploits conditional independence. This can be viewed as a tree
search that branches over all values of the hidden variables.
Variable elimination avoids unnecessary computation. It runs in polynomial time for
poly-tree BNs. In general, exact probabilistic inference is #P-hard. Approximate
probabilistic inference methods exist.

Inference by sampling: A whole zoo of methods for doing this exists.
Clustering: Pre-combining subsets of variables to reduce the running time of in-

ference.
Compilation to SAT: More precisely, to “weighted model counting” in CNF for-
mulas. Model counting extends DPLL with the ability to determine the number
of satisfying interpretations. Weighted model counting allows to define a mass for
each such interpretation (= the probability of an atomic event).
Dynamic BN: BN with one slice of variables at each “time step”, encoding proba-
bilistic behavior over time.
Relational BN: BN with predicates and object variables.
First-order BN: Relational BN with quantification, i.e. probabilistic logic. E.g.,

the BLOG language developed by Stuart Russel and co-workers.
Reading:
• Chapter 14: Probabilistic Reasoning of [RN03].
– Section 14.1 roughly corresponds to my “What is a Bayesian Network?”.
– Section 14.2 roughly corresponds to my “What is the Meaning of a Bayesian Network?” and
“Constructing Bayesian Networks”.The main change I made here is to define the semantics
of the BN in terms of the conditional independence relations, which I find clearer than RN’s
definition that uses the reconstructed full joint probability distribution instead.
– Section 14.4 roughly corresponds to my “Inference in Bayesian Networks”. RN give full details
on variable elimination, which makes for nice ongoing reading.
– Section 14.3 discusses how CPTs are specified in practice.
– Section 14.5 covers approximate sampling-based inference.

– Section 14.6 briefly discusses relational and first-order BNs.
– Section 14.7 briefly discusses other approaches to reasoning about uncertainty.
All of this is nice as additional background reading.

Chapter 25
Making Simple Decisions Rationally
25.1 Introduction
Decision Theory
Definition 25.1.1. Decision theory investigates decision problems, i.e. how an
agent a deals with choosing among actions based on the desirability of their out-
comes given by a real-valued utility function u on states s∈S: i.e. u : S→R.
Wait: Isn’t that what we did in section 8.1?

Yes, but: Now we do it for stochastic (i.e. non-deterministic), partially observable
environments.
Recall: We call the environment of an agent A
fully observable, iff the A’s sensors give it access to the complete state of the
deterministic, iff the next state of the environment is completely determined by
the current state and A’s action, else stochastic.
episodic, iff A’s experience is divided into atomic episodes, where it perceives
and then performes a single action. Crucially the next episode does not depend
on previous ones. Non-episodic environments are called sequential.
For now: We restrict ourselves to episodic decision theory, which deals with
choosing among actions based on the desirability of their immediate outcomes. (no
need to treat time explicitly)
Later:
We will study sequential decision problems, where the agent’s utility depends on a
sequence of decisions. (chapter 27)
Utility-based agents
503
504 CHAPTER 25. MAKING SIMPLE DECISIONS RATIONALLY
Definition 25.1.2. A utility based agent uses a world model along with a utility
function that models its preferences among the states of that world. It chooses the
action that leads to the best expected utility.
54 Chapter 2. Intelligent Agents
Agent Schema:
Sensors
State
What the world
Environment

in such a state
What action I
should do now
Agent Actuators
action that leadsMichael
best expected 819
Maximizing Expected Utility (Ideas)

Definition
to maximize.25.1.3
An agent (MEU principle
that possesses an for Rationality).
explicit utility function Wecancall an rational
make action rational
decisionsif
itwith
maximizes expectedalgorithm
a general-purpose utility. An thatutility
does not based
dependagent
on the is called
specificrational, iff it always
utility function being
chooses
maximized.a rational action.
In this way, the “global” definition of rationality—designating as rational those
Note: An agent
rational-agent designscan be be
that can entirely
expressed rational (consistent
in a simple program.with MEU) without ever
representing or manipulating
The utility-based utilities
agent structure and probabilities.
appears in Figure 2.14. Utility-based agent programs
Example 25.1.4. A simple reflex agent for tic tac toe based on a perfect lookup
inherent in stochastic or partially observable environments.
table isAtrational if the
this point, we reader
take “winning/drawing
may be wondering, “Is in itnthat
steps” as the
simple? We utility
just build function.
agents that
maximize expected utility, and we’re done?” It’s true that such agents would be intelligent,
But we will see: An observer can construct a value function V by observing the
but it’s not simple. A utility-based agent has to model and keep track of its environment,
agent’s preferences. (even if the agent does not know V )
tasks that have involved a great deal of research on perception, representation, reasoning,
and learning.
Before we go Theon:results
Letofusthis research fillhow
understand many of meshes
this the chapters
withofAI-1
this content!
book. Choosing
that fill several more chapters. Even with these algorithms, perfect rationality is usually
unachievable in practice because
Michael Kohlhase: of computational
Artificial Intelligence 2 complexity,
820 as we noted in Chapter 1.
2023-02-10

not, so far, explained how the agent programs come into being. In his famous early paper,
Note: All ofconsiders
Turing (1950) these considerations only programming
the idea of actually give requirements to the machines
his intelligent world model.
by hand.


inference =

b logical formula
inference =
b STRIPS,
inference =
Recap: Episodic Decision Theory in AI-1

Observation: In AI-1, the environment of an agent was
fully observable, so the sensor model S and transition model T are functions.
deterministic, so we know the result states of all actions.
The “expected utility” of an action is just the utility of the result state.
So maximizing expected utility is easy: EU(a) = U (T (S(s, e), a)), where e the most
recent percept, and s the current state.
Intuition: Utility functions can be handled like heuristics were in AI-1.
Preview: Episodic Decision Theory in AI-1/2

Problem: In AI-2, the environment may be
partially observable, so we do not know the “current state”.
stochastic, so we do not know the result state of an action.
Idea: Treat the result state of an action a as a random variable Ra .

Study P (Ra = s′ |a, e) given evidence observations e.
The expected utility EU(a) of an action a is then
X
EU(a, E| =) P (Ra = s′ |a, e) · U (s′ )
s′
Intuitively: A formalization of what it means to “do the right thing”.
Hooray: This solves all of the AI problem. (in principle)

Problem: There is a long long way towards an operationalization. (do that now)
Outline of this Chapter

Rational preferences
Utilities and Money
Multi attribute utility functions
Decision networks
Value of information
25.2 Rational Preferences

Preferences in Deterministic Environments

Problem: We cannot directly measure utility of (or satisfaction/happiness in) a
state.
Example 25.2.1. I have to decide whether to go to class today (or sleep in). What
is the utility of this lecture. (obviously 42)
Idea: We can let people/agents choose between two states! (subjective
preference)
Example 25.2.2. Give me your cell-phone or I will give you a bloody nose. ;
To make a decision in a deterministic environment, the agent must determine
whether it prefers a state without phone to one with a bloody nose?
Definition 25.2.3.
Given states A and B (we call them prizes) and agent can express preferences of
the form
A≻B A preferred over B

A∼B indifference between A and B
A⪰B B not preferred over A
Preferences in Non-Deterministic Environments

Problem: In nondeterministic environments we do not have full information about
the states we choose between.
Example 25.2.4 (Airline Food). Do you want chicken or pasta (but we cannot
see through the tin foil)
25.2. RATIONAL PREFERENCES 507
Definition 25.2.5.
Given
Pn prizes Ai and probabilities pi with
i=1 pi = 1, a lottery [p1 ,A1 ;. . .;pn ,An ] repre-
p A
sents the result of a nondeterministic action that L
can have outcomes Ai with prior probability pi . 1−p B
For the binary case, we use [p,A;1−p,B].
We extend preferences to include lotteries for nondeterministic environments.
Rational Preferences
Idea: Preferences of a rational agent must obey constraints:
Rational preferences ; behavior describable as .
Definition 25.2.6. We call a set ≻ of preferences rational, iff the following con-
straints hold:
Orderability A≻B ∨ B≻A ∨ A∼B
Transitivity A≻B ∧ B≻C ⇒ A≻C
Continuity A≻B≻C ⇒ (∃p [p,A;1−p,C]∼B)
Substitutability A∼B ⇒ [p,A;1−p,C]∼[p,B;1−p,C]
Monotonicity A≻B ⇒ (p>q) ⇔ [p,A;1−p,B]≻[q,A;1−q,B]
Decomposability [p,A;1−p,[q,B;1−q,C]]∼[p,A ; ((1 − p)q),B ; ((1 − p)(1 − q)),C]
The rationality constraints can be understood as follows:
Orderability: A≻B ∨ B≻A ∨ A∼B Given any two prizes or lotteries, a rational agent must either prefer one
to the other or else rate the two as equally preferable. That is, the agent cannot avoid deciding.
Refusing to bet is like refusing to allow time to pass.
Transitivity: A≻B ∧ B≻C ⇒ A≻C
Continuity: A≻B≻C ⇒ (∃p [p,A;1−p,C]∼B) If some lottery B is between A and C in preference, then there
is some probability p for which the rational agent will be indifferent between getting B for sure
and the lottery that yields A with probability p and C with probability 1 − p.
Substitutability: A∼B ⇒ [p,A;1−p,C]∼[p,B;1−p,C] If an agent is indifferent between two lotteries A and B, then
the agent is indifferent between two more complex lotteries that are the same except that B
is substituted for A in one of them. This holds regardless of the probabilities and the other
outcome(s) in the lotteries.
Monotonicity: A≻B ⇒ (p>q) ⇔ [p,A;1−p,B]≻[q,A;1−q,B] Suppose two lotteries have the same two possible
outcomes, A and B. If an agent prefers A to B, then the agent must prefer the lottery that has
a higher probability for A (and vice versa).
Decomposability: [p,A;1−p,[q,B;1−q,C]]∼[p,A;((1−p)q),B ;((1−p)(1−q)),C] Compound lotteries can be reduced

to simpler ones using the laws of probability. This has been called the “no fun in gambling” rule
because it says that two consecutive lotteries can be compressed into a single equivalent lottery:
the following two are equivalent:
p
A p A
(1 − p)q
q
B B
1−p
(1 − p)(1 − q)
C
1−q
C
Rational preferences contd.

Violating the constraints leads to self-evident irrationality
Example 25.2.7. An agent with intransitive preferences can be induced to give
away all its money:
If B≻C, then an agent who has C would pay (say) 1 cent to get B
If A≻B, then an agent who has B would pay (say) 1 cent to get A
If C≻A, then an agent who has A would pay (say) 1 cent to get C
25.3 Utilities and Money

Ramseys Theorem and Value Functions

Theorem 25.3.1. (Ramsey, 1931; von Neumann and Morgenstern, 1944)
Given a rational set of preferences there exists a real valued function U such that
X
U (A)≥U (B), iff A⪰B and U ([p1 ,S1 ; . . . ; pn ,Sn ]) = pi U (Si )
i
These are existence theorems, uniqueness not guaranteed.

Note: Agent behavior is invariant w.r.t. positive linear transformation, i.e.
U ′ (x) = k1 U (x) + k2 where k1 > 0
behaves exactly like U .

25.3. UTILITIES AND MONEY 509
Observation: With deterministic prizes only (no lottery choices), only a total
ordering on prizes can be determined.
Definition 25.3.2. We call a total ordering on states a value function or ordinal
utility function.
Maximizing Expected Utility (Definitions)

We first formalize the notion of expectation of a random variable.
Definition 25.3.3. Given aPprobability model ⟨Ω, P ⟩ and a X : Ω→R+
0 a ran-
dom variable, then E(X):= x∈Ω P (X = x) · x is called the expected value (or
expectation) of X.
Idea: Apply this idea to get the expected utility of an action, this is stochastic:
In partially observable environments, we do not know the current state.
In nondeterministic environments, we cannot be sure of the result of an action.
Definition 25.3.4. Let A be an agent with a set Ω of states and a utility function
U : Ω→R+ 0 , then for each action a, we define a random variable Ra whose values
are the results of performing a in the current state.
Definition 25.3.5. The expected utility EU(a|e) of an action a (given evidence e)
is X
EU(a|e):= P (Ra = s|a, e) · U (s)
s∈Ω
Utilities
Intuition: Utilities map states to real numbers.
Question: Which numbers exactly?
Definition 25.3.6 (Standard approach to assessment of human utilities).

Compare a given state A to a standard lottery Lp that has
“best possible prize” u⊤ with probability p
“worst possible catastrophe” u⊥ with probability 1 − p
adjust lottery unconditional probabilityprobability p until A∼Lp . Then U (A) = p.
Example 25.3.7. Choose u⊤ =

b current state, u⊥ =
b instant death
0.999999 continue as before

pay $30∼L
0.000001 instant death
Measuring Utility
Definition 25.3.8. Normalized utilities: u⊤ = 1, u⊥ = 0.
Definition 25.3.9. Micromorts: one millionth chance of instant death.

Micromorts are useful for Russian roulette, paying to reduce product risks, etc.
Problem: What is the value of a micromort?
Ask them directly: What would you pay to avoid playing Russian roulette with
a million-barrelled revolver? (very large numbers)
But their behavior suggests a lower price:
Driving in a car for 370km incurs a risk of one micromort;
Over the life of your car – say, 150, 000km that’s 400 micromorts.
People appear to be willing to pay about 10, 000€ more for a safer car that
halves the risk of death. (; 25€ per micromort)
This figure has been confirmed across many individuals and risk types.
Of course, this argument holds only for small risks. Most people won’t agree to
kill themselves for 25M€.
Definition 25.3.10. QALYs: quality adjusted life years
Application: QALYs are useful for medical decisions involving substantial risk.
Money vs. Utility

Money does not behave as a utility function should.
Given a lottery L with expected monetary value EMV(L), usually U (L) < U (EMV(L)),
i.e., people are risk averse.
Utility curve: For what probability p am I indifferent between a prize x and a
lottery [p,M $;1−p,0$] for large M ?
Typical empirical data, extrapolated with risk prone behavior for debitors:
25.4. MULTI-ATTRIBUTE UTILITY 511
Empirically: comes close to the logarithm on the positive numbers.
25.4 Multi-Attribute Utility

Utility Functions on Attributes

Recap: So far we understand how to obtain utility functions u : S→R on states
s∈S from (rational) preferences.
But in a partially observable, stochastic environment, we cannot know the current
state. (utilities/preferences useless?)
Idea: Base utilities/preferences on random variables that we can model.

Definition 25.4.1.
Let X 1 , . . ., X n be random variables with domains D1 , . . ., Dn . Then we call a
function u : D1 × . . . × Dn →R a (multi-attribute) utility function on attributes
X 1 , . . ., X n .
Intuition: Given a probabilistic belief state that includes random variables X 1 , . . ., X n ,

and a utility function on attributes X 1 , . . ., X n , we can still maximize expected util-
ity! (MEU
principle)
Preview: Understand multi attribute utility functions and use Bayesian networks
as representations of belief states.
Multi-Attribute Utility: Example

Example 25.4.2 (Assessing an Airport Site).
Air Traffic Deaths Attributes: Deaths,

Noise, Cost.
Litigation Noise Question: What is
U (Deaths, Noise, Cost)
Construction Cost for a projected airport?
How can complex utility functions be assessed from preference behaviour?

Idea 1: Identify conditions under which decisions can be made without complete
identification of U (X 1 , . . ., X n ).
Idea 2: Identify various types of independence in preferences and derive consequent
canonical forms for U (X 1 , . . ., X n ).
Strict Dominance
Typically define attributes such that U is monotone in each argument. (wlog.
growing)
Definition 25.4.3. Choice B strictly dominates choice A iff Xi (B) ≥ Xi (A) for
all i (and hence U (B)≥U (A))
Observation: Strict dominance seldom holds in practice (life is difficult) but is

useful for narrowing down the field of contenders.
For uncertain attributes strict dominance is even more unlikely.
Stochastic Dominance
Definition 25.4.4.
A distribution p2 stochastically dominates distribution p1 iff the cummulative dis-
tribution of p2 strictly dominates that for p1 for all t, i.e.
Z
−∞ Z
−∞
p1 (x)dx≤ p2 (x)dx
t t
Example 25.4.5.
Stochastic dominance contd.

Observation 25.4.6.
If U is monotone in x, then A1 with outcome distribution p1 stochastically dominates
A2 with outcome distribution p2 :
Z
−∞ Z
−∞
p1 (x)U (x)dx≥ p2 (x)U (x)dx

∞ ∞
multi-Attribute case: stochastic dominance on all attributes ; optimal

Stochastic dominance can often be determined without exact distributions using
qualitative reasoning.
Example 25.4.7.
Construction cost increases with distance from city S 1 is closer to the city than S 2
; S 1 stochastically dominates S 2 on cost.
Example 25.4.8. Injury increases with collision speed.
Idea: Annotate Bayesian networks with stochastic dominance information.
Definition 25.4.9.
+
X →Y (X positively influences Y ) means that P(Y |X 1 , z) stochastically dominates
P(Y |X 2 , z) for every value z of Y ’s other parents Z and all X 1 and X 2 with
X 1 ≥X 2 .
Label the arcs + or – for influence in a Bayesian Network

Preference Structure and Multi-Attribute Utility

Observation 25.4.10. n attributes with d values each ; need dn values to de-
termine utility function U (X 1 , . . ., X n ). (worst
case)
Assumption: Preferences of real agents have much more structure.
Approach: Identify regularities and prove representation theorems based on these:
U (X 1 , . . ., X n ) = F (f 1 (X 1 ), . . . , f n (f n )X n )
where F is simple, e.g. addition.

Note the similarity to Bayesian networks that decompose the full joint probability
distribution.
Preference structure: Deterministic

Recall: In deterministic environments an agent has a value function.
Definition 25.4.11. X 1 and X 2 preferentially independent of X 3 iff preference

between ⟨x1 , x2 , z⟩ and ⟨x′ 1 , x′ 2 , z⟩ does not depend on z.
Example 25.4.12. E.g., ⟨Noise, Cost, Safety⟩: are preferentially independent
⟨20,000 suffer, 4.6 G$, 0.06 deaths/mpm⟩ vs.⟨70,000 suffer, 4.2 G$, 0.06 deaths/mpm⟩
Theorem 25.4.13 (Leontief, 1947). If every pair of attributes is preferentially

independent of its complement, then every subset of attributes is preferentially in-
dependent of its complement: mutual preferential independence.
Theorem 25.4.14 (Debreu, 1960). Mutual P preferential independence implies
that there is an additive value function: V (S) = i Vi (Xi (S)), where Vi is a value
function referencing just one variable Xi .
Hence assess n single-attribute functions. (often a good approximation)
Example 25.4.15. The value function for the airport decision might be
V (noise, cost, deaths) = −noise · 104 − cost − deaths · 1012
Preference structure: Stochastic

Need to consider preferences over lotteries and real utility functions(not just value
functions)
25.5. DECISION NETWORKS 517
Definition 25.4.16. X is utility independent of Y iff preferences over lotteries in

X do not depend on particular values in Y.
Definition 25.4.17. A set X is mutually utility independent (MUI), iff each subset
is utility independent of its complement.
Example 25.4.18. Arguably, the attributes of Example 25.4.2 are MUI.

Theorem 25.4.19. For MUI sets of attributes, there is a multiplicative utility
function: [Kee74]
Definition 25.4.20. We “define” a multiplicative utility function by example: For
three attributes we have:
U = k 1 U 1 +k 2 U 2 +k 3 U 3 +k 1 k 2 U 1 U 2 +k 2 k 3 U 2 U 3 +k 3 k 1 U 3 U 1 +k 1 k 2 k 3 U 1 U 2 U 3
System Support: Routine procedures and software packages for generating pref-
erence tests to identify various canonical families of utility functions.
25.5 Decision Networks

A Video Nugget covering this section can be found at https://fau.tv/clip/id/30345. Now
that we understand the utilities, we can complete our design of a utilty-based agent, which we
now recapitulate as a refresher.
54 Utility-Based Agents (Recap) Chapter 2. Intelligent Agents
Sensors
State
What the world
Environment


in such a state
What action I
should do now
Agent Actuators

a utility function that measures its preferences among states
Michael Kohlhase: Artificial Intelligence 2 843
of the world.
2023-02-10
Then it chooses the
action that leads to the best expected utility, where expected utility is computed by averaging
overalready
As we all possible outcomenetworks
use Bayesian states, weighted by the probability
for the world/belief model,ofintegrating
the outcome. utilities and pos-
sible actions into the network suggests itself naturally. This leads to the notion of a decision
network.
to maximize. An agent that possesses an explicit utility function can make rational decisions
Decision networks
Definition 25.5.1. A decision network is a Bayesian network with added action
nodes and utility nodes (also called value node) that enable decision making.
Example 25.5.2 (Choosing an Airport Site).
Algorithm:
For each value of action node
compute expected value of utility node given action, evidence
Return MEU action (via argmax)
Decision Networks: Example

Example 25.5.3 (A Decision-Network for Aortic Coarctation). from [Luc96]

25.6. THE VALUE OF INFORMATION 519
Knowledge Eng. for Decision-Theoretic Expert Systems

Question: How do you create a model like the one from Example 25.5.3?
Answer: By a systematic process of the form: (after [Luc96])
1. Create a causal model: a graph with nodes for symptoms, disorders, treatments,
outcomes, and their influences (edges).
2. Simplify to a qualitative decision model: remove vars not involved in treatment
decisions.
3. Assign probabilities: (; Bayesian network)
e.g. from patient databases, literature studies, or the expert’s subjective assess-
ments
4. Assign utilities. (e.g. in QALYs or micromorts)
5. Verify and refine the model wrt. a gold standard given by experts
e.g. refine by “running the model backwards” and compare with the literature.
6. Perform sensitivity analysis: (important step in practice)
is the optimal treatment decision robust against small changes in the parameters?
(if yes ; great! if not, collect better data)
So far we have tacitly been concentrating on actions that directly affect the environment. We
will now come to a type of action we have hypothesized in the beginning of the course, but have
completely ignored up to now: information acquisition actions.
25.6 The Value of Information

What if we do not have all information we need?

It is Well-Known: One of the most important parts of decision making is knowing
what questions to ask.
Example 25.6.1 (Medical Diagnosis).

We do not expect a doctor to already know the results of the diagnostic tests
when the patient comes in.
Tests are often expensive, and sometimes hazardous. (directly or by delaying
treatment)
Therefore: Only test, if
knowing the results lead to a significantly better treatment plan,
information from test results is not drowned out by a-priori likelihood.
Definition 25.6.2. Information value theory enables the agent to make decisions
on information gathering rationally.
Intuition: Simple form of sequential decision making. (action only impacts

belief state).
Intuition: With the information, we can change the action to the actual informa-
tion, rather than the average.
Value of Information by Example

Idea: Compute value of acquiring each possible piece of evidence.
We will see: This can be done directly from a decision network.
Example 25.6.3 (Buying Oil Drilling Rights).
There are n blocks of rights, exactly one has oil, worth k€, in particular
Prior probabilities 1/n each, mutually exclusive.
Current price of each block is k/n€.
“Consultant” offers accurate survey of block 3. What’s a fair price?
Solution: Compute expected value of information = b expected value of best action

given the information minus expected value of best action without information.
Example 25.6.4 (Oil Drilling Rights contd.).
Survey may say oil in block 3 with probability 1/n ; buy block 3 for k/n€
make profit of (k − k/n)€.
Survey may say no oil in block 3 with probability (n − 1)/n ; buy another
block, make profit of k/(n − 1) − k/n€.
1 (n−1)k n−1 k k
Expected profit is n · n + n · n(n−1) = n.
So, we should pay up to k/n€ for the information. (as much as block 3 is
worth)
General formula (VPI)

Given current evidence E, possible actions a∈A with outcomes in Sa , and current
best action α X
EU(α|E) = max ( U (s) · P (s|E, a))
a∈A
s∈Sa
Suppose we knew F = f (new evidence), then we would choose αf s.t.

X
EU(αf |E, F = f ) = max ( U (s) · P (s|E, a, F = f ))
a∈A
s∈Sa
here, F is a random variable with domain D whose value is currently unknown.

25.6. THE VALUE OF INFORMATION 521
Idea: So we must compute the expected gain over all possible values f ∈D.
Definition 25.6.5. Let F be a random variable with domain D, then the value of
perfect information (VPI) on F given evidence E is defined as
X
VPIE (F ):=( P (F = f |E) · EU(αf |E, F = f )) − EU(α|E)
f ∈D
where αf = argmax EU(a|E, F = f ) and A the set of possible actions.

a∈A
Properties of VPI
Observation 25.6.6 (VPI is Non-negative). VPIE (F ) ≥ 0 for all j and E (in

expectation, not post hoc)
Observation 25.6.7 (VPI is Non-additive). VPIE (F, G) ̸= VPIE (F ) + VPIE (G)
(consider, e.g., obtaining F twice)
Observation 25.6.8 (VPI is Order-independent).
VPIE (F, G) = VPIE (F ) + VPIE,F (G) = VPIE (G) + VPIE,G (F )
Note:
When more than one piece of evidence can be gathered,
maximizing VPI for each to select one is not always optimal
; evidence-gathering becomes a sequential decision problem.
Qualitative behavior of VPI

Question: Say we have three distributions for P (U |Ej )
Qualitatively: What is the value of information (VPI) in these three cases?

Answers: reserved for the plenary sessions ; be there!

We will now use information value theory to specialize our utility-based agent from above.
A simple Information-Gathering Agent
Definition 25.6.9. A simple information gathering agent. (gathers info before

acting)
function Information−Gathering−Agent (percept) returns an action
persistent: D, a decision network
integrate percept into D
j := argmax VPIE (Ek )/Cost(Ek )
k
if VPIE (Ej ) > Cost(Ej ) return Request(Ej )
else return the best action from D
The next percept after Request(Ej ) provides a value for Ej .

Problem: The information gathering implemented here is myopic, i.e. calculating
VPI as if only a single evidence variable will be acquired. (cf. greedy search)
But it works relatively well in practice. (e.g. outperforms humans for selecting
diagnostic tests)

Chapter 26
Temporal Probability Models
Outline of this chapter

Modeling time and uncertainty for sequential environments.
Markov inference: Filtering, prediction, smoothing, and most likely explanation.

Hidden Markov models
Dynamic Bayesian networks
Particle filtering?
Further algorithms and Topics?
26.1 Modeling Time and Uncertainty

Time and uncertainty

Observation 26.1.1. The world changes; we need to track and predict it!
Example 26.1.2.
Consider the following decision problems:
Vehicle diagnosis: car state constant during diagnosis ; episodic!

Diabetes management: patient state can quickly deteriorate ; sequential!
Here we lay the mathematical foundations for the latter.
Definition 26.1.3. A temporal probability model is a probability model, where
possible worlds are indexed by a time structure ⟨S, ⪯⟩.
We restrict ourselves to linear, discrete time structures, i.e. ⟨S, ⪯⟩ = ⟨N, ≤⟩.(Step
size irrelevant for theory, depends on problem in practice)
523
524 CHAPTER 26. TEMPORAL PROBABILITY MODELS
Definition 26.1.4 (Basic Setup). A temporal probability model has two sets of
random variables indexed by N.
Xt = b set of (unobservable) state variables at time t≥0
e.g., BloodSugart , StomachContentst , etc.
Et = b set of (observable) evidence variables at time t>0
e.g., MeasuredBloodSugart , PulseRatet , FoodEatent
Notation: Xa:b = Xa , Xa+1 , . . . , Xb−1 , Xb
Time and uncertainty (Running Example)

Example 26.1.5 (Umbrellas). You are a security guard in a secret underground
facility, want to know it if is raining outside. Your only source of information is
whether the director comes in with an umbrella.
State variables: R0 , R1 , R2 , . . .,
Observations (evidence variables): U1 , U2 , U3 , . . .
Markov Processes
Idea: Construct a Bayesian network from these variables. (parents?)
Definition 26.1.6. Markov property: Xt only depends on a bounded subset of

X0:t−1 . (in particular not on E1:t )
Definition 26.1.7. A (discrete-time) Markov process is a sequence of random
variables with the Markov property.
Definition 26.1.8. We say that a Markov process has the nth order Markov
property for n∈N+ , iff P(Xt |X0:t−1 ) = P(Xt |Xt−n:t−1 ). Special Cases
First-order Markov property: P(Xt |X0:t−1 ) = P(Xt |Xt−1 )
Xt−2 Xt−1 Xt Xt+1 Xt+2
A first order Markov process is called a Markov chain.

Second-order Markov property: P(Xt |X0:t−1 ) = P(Xt |Xt−2 , Xt−1 )
Xt−2 Xt−1 Xt Xt+1 Xt+2
Intuition: Increasing the order adds “memory” to the process, Markov chains have
none.
Preview: We will use Markov processess to model sequential environments.
26.1. MODELING TIME AND UNCERTAINTY 525
Markov Process Example: The Umbrella

Example 26.1.9 (Umbrellas continued). We model the situation in a Bayesian
network:
Raint−1 Raint Raint+1
Umbrellat−1 Umbrellat Umbrellat+1
Problem: First-order Markov property not exactly true in real world!
Possible fixes:
1. Increase the order of the Markov process. (more dependencies)
2. Add state variables, e.g., add Tempt , Pressuret . (more information sources)
We will see the second in another example: tracking robot motion.
Markov Process Example: Robot Motion

Example 26.1.10 (Random Robot Motion). To track a robot wandering ran-
domly on the X/Y plane, use the following Markov chain
Vt−1 Vt Vt+1
Xt−1 Xt Xt+1
Zt−1 Zt Zt+1
We use Newton’s laws to calculate the new position

the velocity Vi may change unpredictably.
the position Xi depends on previous position Xi−1 and velocity Vi−1
the position Xi influences the observed position Zi .
Example 26.1.11 (Battery Powered Robot). Markov property violated!
Battery exhaustion has a systematic effect on the change in velocity.
This depends on how much power was used by all previous manoeuvres.
Idea: We can restore the Markov property by including a state variable for the
charge level Bt . (Better still: Battery level sensor)
Markov Process Example: Robot Motion

Example 26.1.12 (Battery Powered Robot Motion).
Mt−1 Mt Mt+1
Bt−1 Bt Bt+1
Vt−1 Vt Vt+1
Xt−1 Xt Xt+1
Zt−1 Zt Zt+1
Battery level Bi is influenced by previous level Bi−1 and velocity Vi−1 .

Velocity Vi is influenced by previous level Bi−1 and velocity Vi−1 as well.
Battery meter Mi is only influenced by Battery level Bi .
Stationary Markov Processes as Transition Models

Theorem 26.1.13. Let M be a Markov chain with state variables Xt evidence
variables Et ; then P(Xt |Xt−1 ) is the transition model and P(Et |X0:t , E1:t−1 ) the
sensor model of M .
Problem: Even with Markov property the transition model is infinite. (t∈N)
Definition 26.1.14. A Markov chain is called stationary if the transition model is
independent of time, i.e. P(Xt |Xt−1 ) is the same for all t.
Example 26.1.15 (Umbrellas are stationary). P(Rt |Rt−1 ) does not depend on
t. (need only one table)

Rt−1 P (Rt )
T 0.7
F 0.3
: Don’t confuse “stationary” (Markov processes) with “static” (environments).

We restrict ourselves to stationary Markov processes in AI-2.

26.2. INFERENCE: FILTERING, PREDICTION, AND SMOOTHING 527
Markov Sensor Models

Recap: The sensor model predicts the influence of percepts (and the world state)
on the belief state. (used during update)
Problem: The evidence variables Et could depend on previous variables as well

as the current state.
We restrict dependency to current state. (otherwise state repn. deficient)
Definition 26.1.16. We say that a sensor model has the sensor Markov property,
iff P(Et |X0:t , E1:t−1 ) = P(Et |Xt )
Assumptions on Sensor Models: We usually assume the sensor Markov property
and make it stationary as well: P(Et |Xt ) is fixed for all t.
Umbrellas, the full Story
Example 26.1.17 (Umbrellas, Transition & Sensor Models).

Rt−1 P (Rt )
T 0.7
F 0.3 Rt P (Ut )
T 0.9
F 0.2
Note that influence goes from Raint to Umbrellat . (causal dependency)

Observation 26.1.18. If we additionally know the initial prior probabilities P(X0 )
b time t = 0), then we can compute the full joint probability distribution as
(=
t
Y
P(X0:t , E1:t ) = P(X0 ) · P(Xi |Xi−1 ) · P(Ei |Xi )
i=1
26.2 Inference: Filtering, Prediction, and Smoothing

Video Nuggets covering this section can be found at https://fau.tv/clip/id/30350, https:
//fau.tv/clip/id/30351, and https://fau.tv/clip/id/3052.
Inference tasks
Definition 26.2.1. The Markov inference tasks consist of filtering, prediction,
smoothing, and most likely explanation as sdefined below.
Definition 26.2.2. Filtering (or monitoring): P(Xt |e1:t )
computing the belief state input to the decision process of a rational agent.
Definition 26.2.3. Prediction (or state estimation): P(Xt+k |e1:t ) for k > 0
evaluation of possible action sequences. (=
b filtering without the evidence)
Definition 26.2.4. Smoothing (or hindsight): P(Xk |e1:t ) for 0 ≤ k < t
better estimate of past states. (essential for learning)
Definition 26.2.5. Most likely explanation: argmax (P (x1:t |e1:t ))
x1:t
speech recognition, decoding with a noisy channel.
Filtering (Computing the Belief State given Evidence)

Aim: Recursive state estimation: P(Xt+1 |e1:t+1 ) = f (et+1:, P(Xt |e1:t ))
Project the current distribution forward from t to t + 1:
P(Xt+1 |e1:t+1 ) = P(Xt+1 |e1:t , et+1 ) (dividing up evidence)

= α · P(et+1 |Xt+1 , e1:t ) · P(Xt+1 |e1:t ) (using Bayes’ rule)
= α · P(et+1 |Xt+1 ) · P(Xt+1 |e1:t ) (sensor Markov property)
Note: P(et+1 |Xt+1 ) can be obtained directly from the sensor model.
Continue by conditioning on the current state Xt :
P(Xt+1 |e1:t+1 )
X
= α · P(et+1 |Xt+1 ) · ( P(Xt+1 |xt , e1:t ) · P (xt |e1:t ))
xt
X
= α · P(et+1 |Xt+1 ) · ( P(Xt+1 |xt ) · P (xt |e1:t ))
xt
P(Xt+1 |Xt ) is simply the transition model, P (xt |e1:t ) the “recursive call”.
So f 1:t+1 = α · FORWARD(f 1:t , et+1 ) where f 1:t = P(Xt |e1:t ) and FORWARD is
the update shown above. (Time and space constant (independent of t))
Filtering the Umbrellas

Example 26.2.6. Say the guard believes P(R0 ) = ⟨0.5, 0.5⟩. On day 1 and 2 the
umbrella appears.
X
P(R1 ) = P(R1 |r0 ) · P (r0 ) = ⟨0.7, 0.3⟩ · 0.5 + ⟨0.3, 0.7⟩ · 0.5 = ⟨0.5, 0.5⟩
r0
Update with evidence for t = 1 gives:

P(R1 |u1 ) = α · P(u1 |R1 ) · P(R1 ) = α · ⟨0.9, 0.2⟩⟨0.5, 0.5⟩ = α · ⟨0.45, 0.1⟩ ≈ ⟨0.818, 0.182⟩
Prediction in Markov Chains

Prediction computes future k > 0 state distributions: P(Xt+k |e1:t ).
Intuition: Prediction is filtering without new evidence.
X
Lemma 26.2.7. P(Xt+k+1 |e1:t ) = P(Xt+k+1 |xt+k ) · P (xt+k |e1:t )
xt+k
Proof sketch: Using the same reasoning as for the FORWARD algorithm for filter-
ing.
Observation 26.2.8. As k → ∞, P (xt+k |e1:t ) tends to the stationary distribution
of the Markov chain, i.e. the a fixed point under prediction.
Intuition: The mixing time, i.e. the time until prediction reaches the stationary
distribution depends on how “stochastic” the chain is.
Smoothing
Smoothing estimates past states by computing P(Xk |e1:t ) for 0 ≤ k < t
Divide evidence e1:t into e1:k (before k) and ek+1:t (after k):
P(Xk |e1:t ) = P(Xk |e1:k , ek+1:t )

= α · P(Xk |e1:k ) · P(ek+1:t |Xk , e1:k ) (Bayes Rule)
= α · P(Xk |e1:k ) · P(ek+1:t |Xk ) (cond. independence)
= α · f 1:k · bk+1:t
Smoothing (continued)
Backward message bk+1:t = P(ek+1:t |Xk ) computed by a backwards recursion:
X
P(ek+1:t |Xk ) = P(ek+1:t |Xk , xk+1 ) · P(xk+1 |Xk )
xk+1
X
= P (ek+1:t |xk+1 ) · P(xk+1 |Xk )
xk+1
X
= P (ek+1 , ek+2:t |xk+1 ) · P(xk+1 |Xk )
xk+1
X
= P (ek+1 |xk+1 ) · P (ek+2:t |xk+1 ) · P(xk+1 |Xk )
xk+1
P (ek+1 |xk+1 ) and P(xk+1 |Xk ) can be directly obtained from the model, P (ek+2:t |xk+1 )
is the “recursive call” (bk+2:t ).
In message notation: bk+1:t = BACKWARD(bk+2:t , ek+1:t ) where BACKWARD
is the update shown above. (time and space constant (independent of t))
Smoothing example
Example 26.2.9 (Smoothing Umbrellas). Umbrella appears on days 1/2.
P(R1 |u1 , u2 ) = α · P(R1 |u1 ) · P(u2 |R1 ) = α · ⟨0.818, 0.182⟩ · P(u2 |R1 )
Compute P(u2 |R1 ) by backwards recursion:
X
P(u2 |R1 ) = P (u2 |r2 ) · P (|r2 ) · P(r2 |R1 )
r2
= 0.9 · 1 · ⟨0.7, 0.3⟩ + 0.2 · 1 · ⟨0.3, 0.7⟩ = ⟨0.69, 0.41⟩
So P(R1 |u1 , u2 ) = α · ⟨0.818, 0.182⟩ · ⟨0.69, 0.41⟩ ≈ 0.883, 0.117
Smoothing gives a higher probability

for rain on day 1
umbrella on day 2
; rain more likely on day 2
; rain more likely on day 1.
Forward/Backward Algorithm for Smoothing

Definition 26.2.10. Forward backward algorithm: cache forward messages along
the way:
function Forward−Backward (ev,prior)
returns: a vector of probability distributions
inputs: ev, a vector of evidence evidence values for steps 1, . . . ,t
prior, the prior distribution on the initial state, P (X0 )
local: f v, a vector of forward messages for steps 0, . . . , t
b, a representation of the backward message, initially all 1s
sv, a vector of smoothed estimates for steps 1, . . . ,t
f v[0] := prior
for i = 1 to t do
f v[i] := FORWARD(f v[i − 1], ev[i])
for i = t downto 1 do
sv[i] := NORMALIZE(f v[i]b)
b := BACKWARD(b, ev[i])
return sv
Time complexity linear in t (polytree inference), Space complexity O(t · #(f )).
Most Likely Explanation

Observation 26.2.11. Most likely sequence notequal sequence of most likely
states!
Example 26.2.12. Suppose the umbrella sequence is T, T, F, T, T what is the most
likely weather sequence?
Prominent Application: In speech recognition, we want to find the most likely

word sequence, given what we have heard. (can be quite noisy)
Idea: Use smoothing to find posterior distribution in each time step, construct
sequence of most likely states.
Problem: These posterior distributions range over a single time step. (and this
difference matters)
Most Likely Explanation (continued)

Most likely path to each xt+1 = most likely path to some xt plus one more step
max (P(x1 , . . . , xt , Xt+1 |e1:t+1 ))

x1 ,...,xt
= P(et+1 |Xt+1 ) · max (P(Xt+1 |xt ) · max (P (x1 , . . . , xt−1 , xt |e1:t )))
xt x1 ,...,xt−1
Identical to filtering, except f 1:t replaced by
m1:t = max (P(x1 , . . . , xt−1 , Xt |e1:t ))

x1 ,...,xt−1
I.e., m1:t (i) gives the probability of the most likely path to state i.
Update has sum replaced by max, giving the Viterbi algorithm:
m1:t+1 = P(et+1 |Xt+1 ) · max (P(Xt+1 |xt , m1:t ))

xt
Observation 26.2.13. Viterbi has linear time complexity (like filtering), but linear
space complexity (needs to keep a pointer to most likely sequence leading to each
state).
Viterbi example
Example 26.2.14 (Viterbi for Umbrellas). View the possible state sequences for
Raint as paths through state graph.
Operation of the Viterbi algorithm for the sequence [T, T, F, T, T]:
values are m1:t (probability of best sequence reaching state at time t)

bold arrows: best predecessor measured by “best preceding sequence probability
× transition probability”
To find “most likely sequence”, follow bold arrows back from “most likely state m1:5 .

26.3. HIDDEN MARKOV MODELS 533
26.3 Hidden Markov Models

https://fau.tv/clip/id/30354. The preceding section developed algorithms for temporal
probabilistic reasoning using a general framework that was independent of the specific form of the
transition and sensor models. In this section, we discuss more concrete models and applications
that illustrate the power of the basic algorithms and implementation issues.
In particular, we will introduce hidden Markov models, special simple Markov chains where
Markov inference can be expressed in terms of matrix calculations.
Hidden Markov Models

Definition 26.3.1. A hidden Markov model (HMM) is a Markov chain with a single,
discrete state variable Xt with domain {1, . . . , S} and a single, discrete evidence
variable.
Example 26.3.2. Example 26.1.5 is a HMM.
Observation: Transition model P(Xt |Xt1 ) =
b a single S × S matrix.
Definition 26.3.3. Transition matrix: Tij :=P (Xt = j|Xt1 = i)

0.7 0.3
Example 26.3.4 (Umbrellas). T = P (Xt |Xt−1 ) = .
0.3 0.7
Observation: Sensor model P(et |Xt = i) =

b an S-vector.
Idea: Re-cast Markov inference as matrix calculations.

This works best, if we make the sensor model into a diagonal matrix. (see below)
Definition 26.3.5. Sensor matrix Ot for each time step =
b diagonal matrix with
Otii = P (et |Xt = i).
Example 26.3.6 (Umbrellas). With U1 = T and U3 = F we have

0.9 0 0.1 0
O1 = and O3 =
0 0.2 0 0.8
HMM Algorithm
Idea: The forward and backward messages are column vectors in HMMs.
Definition 26.3.7. Recasting the Markov inference as matrix computation, gives

us two identities:
HMM filtering equation: f 1:t+1 = α · Ot+1 Tt f 1:t
HMM smoothing equation: bk+1:t = TOk+1 bk+2:t
Observation 26.3.8. The forward backward algorithm for HMMs has time com-
plexity O(S 2 t) and space complexity O(St).

Example: Robot Localization using Common Sense

Example 26.3.9 (Robot Localization in a Maze). A robot has four sonar sensors
that tell it about obstacles in four directions: N, S, W, E.
Notation: We write the result where the sensor that detects obstacles in the
north, south, and east as N S E.
Example 26.3.10 (Filter out Impossible States).
a) Possible robot locations after e1 = N S W
b) Possible robot locations after e1 = N S W and e2 = N S

Remark 26.3.11. This only works for perfect sensors. (else no impossible states)
HMM Example: Robot Localization (Modeling)
Example 26.3.12 (HMM-based Robot Localization).

Random variable Xt for robot location (domain: 42 empty squares)
Transition matrix for the move action: (T has 422 = 1764 entries)
1
#(N (i)) if (j∈N (i))
P (Xt+1 = j|Xt = i) = Tij =
0 else
where N (i) is the set of neighboring fields of state i.

1
We do not know where the robot starts: P (X0 ) = n (here n = 42)
Evidence variable Et : four bit presence/absence of obstacles in N, S, W, E. Let
dit be the number of wrong bits and ϵ the error rate of the sensor.
4−dit
P (Et = et |Xt = i) = Otii = (1 − ϵ) · ϵdit
For instance, the probability that the sensor on a square with obstacles in north
3
and south would produce N S E is (1 − ϵ) · ϵ1 .
26.3. HIDDEN MARKOV MODELS 535
Idea: Use the HMM filtering equation f 1:t+1 = α · Ot+1 Tt f 1:t for localization.
(next)
HMM Example: Robot Localization
Idea: Use HMM filtering equation f 1:t+1 = α · Ot+1 Tt f 1:t to compute posterior
distribution over locations. (i.e. robot localization)
Example 26.3.13. Redoing Example 26.3.9, with ϵ = 0.2.
a) Posterior distribution over robot location after E1 = N S W
b) Posterior distribution over robot location after E1 = N S W and E2 = N S

Still the same locations as in the “perfect sensing” case, but now other locations
have non-zero probability.
HMM Example: Further Inference Applications

Idea: Use smoothing: bk+1:t = TOk+1 bk+2:t to find out where it started and
the Viterbi algorithm to find the most likely path it took.
Example 26.3.14.Performance of HMM localization vs. observation length
(various error rates ϵ)
Section15.3.
Section 15.3. HiddenMarkov
Hidden MarkovModels
Models 583
583
66 11
5.5
5.5 ==0.20
0.20
0.9
==0.10
0.10 0.9
55 = 0.05 0.8
= 0.05 0.8
Localization error
4.5
Localization error
4.5 ==0.02
0.02
accuracy
0.7
Path accuracy
44 0.7
==0.00
0.00
3.5
3.5 0.6
0.6 ==0.00
0.00
33 0.5
0.5 ==0.02
0.02
2.5
Path
2.5 0.4
0.4 ==0.05
0.05
22 ==0.10
0.10
0.3
0.3 = 0.20
= 0.20
1.5
1.5
11 0.2
0.2
0.5
0.5 0.1
0.1
00 5 5 1010 1515 2020 25 25 3030 35
35 40
40 00 55 10 15
10 15 20 20 25 25 3030 35
35 40
40
Numberofofobservations
Number observations Numberofofobservations
Number observations
Localization error(a) (Manhattan dis-

(a) (b)
Viterbi path accuracy
(b) (fraction of
tance from true location) correct states on Viterbi path)
Figure15.8
Figure 15.8 Performance
PerformanceofofHMM HMMlocalization
localizationasasaafunction
functionof ofthe
thelength
lengthof ofthe
theobserva-
observa-
tionsequence
tion sequenceforforvarious
variousdifferent
differentvalues valuesofofthethesensor
sensorerror
errorprobability
probability!;!;data
dataaveraged
averagedover
over
400runs.
400 runs.(a)
(a)The
Thelocalization
localizationerror,error,defined
definedasasthetheManhattan
Manhattandistance
distancefrom
fromthe thetrue
truelocation.
location.
(b)The
(b) TheViterbi
ViterbiMichael
pathaccuracy,
path accuracy,defined
definedasasthe thefraction
Kohlhase: Artificial Intelligence 2
fractionof
ofcorrect
correct
879
stateson
states onthe
the Viterbipath.
2023-02-10
Viterbi path.
taken to get where it is now. Figure 15.8 shows the localization error and Viterbi path accuracy
taken to get where it is now. Figure 15.8 shows the localization error and Viterbi path accuracy
Country
for
dance
forvarious
variousvalues
algorithm
valuesofofthe
theper-bit
per-bitsensor
sensorerror
errorrate
rate !.!. Even
Evenwhen
when!!isis20%—which
20%—which means
means that
that
theoverall
the overallsensor
sensorreading
readingisiswrong
wrong59%
59%ofofthe
thetime—the
time—therobot
robotisisusually
usuallyable
ableto
towork
workout
outits
its
location
Idea:
location We within two
cantwo
within squares
avoid storing
squares afterall
after 25forward
25 observations. Thisisin
messages
observations. This isbecause
because
smoothingofthe
of thebyalgorithm’s ability
running ability
algorithm’s
totointegrate
forwardintegrate evidence
algorithm
evidence overtime
timeand
backwards:
over andtototake
takeinto
intoaccount
accountthe
theprobabilistic
probabilisticconstraints
constraints imposed
imposed
onthe
on thelocation
locationsequence
sequence by bythethetransition
transition model.
model. WhenWhen !! isis 10%,
10%, the the performance
performance after after
aahalf-dozen
half-dozenobservations
observationsisishard hardtotodistinguish
distinguish from the· performance
performance with perfect
perfect sensing.
sensing.
t
f 1:t+1 from = the α O t+1 T f 1:twith
Exercise15.7
Exercise 15.7asks
asksyou
youtotoexplore
explorehow how robustthe the HMMlocalization
localization algorithm is to errors in
Ot+1 −1robust
f 1:t+1 HMM = α · Tt f 1:t algorithm is to errors in
the prior distribution P(X
the prior distribution P(X0 )t −1 0 ) and in the transition model itself.
and in the transition model itself. Broadly Broadlyspeaking,
speaking, high
high levels
levels
ofoflocalization
localizationandandpath
path
α·T ′
accuracy
accuracy Ot+1are−1
are maintained
f 1:t+1 even
maintained =evenfinin
1:t theface
the faceof ofsubstantial
substantial errors
errors in
in the
the
modelsused.
models used.
Thestate
statevariable
variable for for thethe example
example we
The
Algorithm: Forward pass computes fwe have considered
have considered inin this this section
section isis aa physical
1:t , backward pass does f 1:i , bt−i:t .
physical
location in the world. Other problems can, of course, include
location in the world. Other problems can, of course, include other aspects of the world. other aspects of the world.
Exercise 15.8 asks you to consider a version of the vacuum robot that
Exercise 15.8 asks you to consider a version of the vacuum robot that has the policy of going has the policy of going
straightfor
straight forasaslong
longasasititcan;
can;only onlywhen
whenititencounters
encounters an an obstacle
obstacle does
does itit change
change toto aa new
new
(randomlyselected)
(randomly selected) heading.
heading. To To model
model thisthis robot,
robot, each
each state
state inin the
the model
model consists
consists ofof aa
(location,heading)
(location, heading)pair.pair. For
Forthe theenvironment
environment ininFigureFigure15.7, 15.7, which
which has has 4242 empty
empty squares,
squares,
this leads to 168 states and a transition matrix with 16822 = 28, 224 entries—still a manageable
this leads to 168 states and a transition matrix with 168 = 28, 224 entries—still a manageable
number.IfIfwe
number. weaddaddthe
thepossibility
possibilityofofdirt dirtininthe
thesquares,
squares,the thenumber
numberof ofstates
statesisismultiplied
multiplied by by
2 42 and the transition matrix ends up with more than 1029
42 29 entries—no longer a manageable
2 and the transition matrix ends up with more than 10 entries—no longer a manageable
number;Section
number; Section15.515.5shows
showshow howtotouseusedynamic
dynamicBayesian
Bayesiannetworks
networksto tomodel
modeldomains
domains withwith
many state variables. If we allow the robot to move continuously
many state variables. If we allow the robot to move continuously rather than in a discrete rather than in a discrete
grid,the
grid, thenumber
numberofofstates
statesbecomes
becomesinfinite;
infinite;thethenext
nextsection
sectionshowsshowshow howto tohandle
handlethis
thiscase.
case.
26.4. DYNAMIC BAYESIAN NETWORKS 537
Observation: Backwards pass only stores one copy of f 1:i , bt:t−i ; constant
space.
Problem: Algorithm is severely limited: transition matrix must be invertible and
sensor matrix cannot have zeroes – that is, that every observation be possible in
every state.
26.4 Dynamic Bayesian Networks

Dynamic Bayesian networks

Definition 26.4.1. A Bayesian network D is called dynamic (a DBN), iff its random
variables are indexed by a time structure. We assume that D is

time sliced, i.e. that the time slices Dt – the subgraphs of t-indexed random
variables and the edges between them – are isomorphic.
a stationary Markov chain, i.e. that variables Xt can only have parents in Dt
and Dt−1 .

Xt , Et contain arbitrarily many variables in a replicated Bayesian network.

Example 26.4.2.
Umbrellas Robot Motion
DBNs vs. HMMs

Observation 26.4.3.
Every HMM is a single variable DBN. (trivially)

Every discrete DBN is an HMM. (combine variables into tuple)
DBNs have sparse dependencies ; exponentially fewer parameters;
Example 26.4.4 (Sparse Dependencies).

With 20 Boolean state variables, three parents each, a DBN has 20 · 23 = 160
parameters, the corresponding HMM has 220 · 220 ≈ 1012 .
Exact inference in DBNs

26.4. DYNAMIC BAYESIAN NETWORKS 539
Definition 26.4.5 (Naive method). Unroll the network and run any exact algo-
rithm.
Problem: Inference cost for each update grows with t.

Definition 26.4.6. Rollup filtering: add slice t + 1, “sum out” slice t using variable
elimination.
Observation: Largest factor is O(dn+1 ), update cost O(dn+2 ), where d is the
maximal domain size.
Note: Much better than the HMM update cost of O(d2n )
Summary
Temporal probability models use state and evidence variables replicated over time.
Markov property and stationarity assumption, so we need both
a transition model and P(Xt |Xt−1 )
a sensor model P(Et |Xt ).
Tasks are filtering, prediction, smoothing, most likely sequence; (all done
recursively with constant cost per time step)
Hidden Markov models have a single discrete state variable; (used for speech
recognition)
DBNs subsume HMMs, exact update intractable.
Particle filtering is a good approximate filtering algorithm for DBNs.

Chapter 27
Making Complex Decisions
A Video Nugget covering the introduction to this chapter can be found at https://fau.tv/
clip/id/30356. We will now pick up the thread from chapter 25 but using temporal models
instead of simply probabilistic ones. We will first look at a sequential decision theory in the special
case, where the environment is stochastic, but fully observable (Markov decision processes) and
then lift that to obtain POMDPs and present an agent design based on that.
Outline
Markov decision processes (MDPs) for sequential environments.
Value/policy iteration for computing utilities in MDPs.

Partially observable MDP (POMDPs).
Decision theoretic agents for POMDPs.
27.1 Sequential Decision Problems

Sequential Decision Problems

Definition 27.1.1. In sequential decision problems, the agent’s utility depends on
a sequence of decisions (or their result states).
Definition 27.1.2. Utility functions on action sequences are often expressed in

terms of immediate rewards that are incurred upon reaching a (single) state.
Methods: depend on the environment:
If it is fully observable ; Markov decision process (MDPs)
else ; partially observable MDP (POMDP).
Sequential decision problems incorporate utilities, uncertainty, and sensing.
Preview: Search problems and planning tasks are special cases.
541
542 CHAPTER 27. MAKING COMPLEX DECISIONS
Search
explicit actions uncertainty
and subgoals and utility
Planning Markov Decision

Problems (MDPs)
uncertainty explicit actions
and subgoals uncertain
and utility sensing belief states
Decision-theoretic Partially observable

Planning MDPs (POMDPs)
Markov Decision Problem: Running Example

Example 27.1.3 (Running Example: The 4x3 World).
A (fully observable) 4 × 3 environment with non-deterministic actions:
States s∈S, actions a∈A.

Transition model: P (s′ |s, a) =
b probability that a in s leads to s′ .
Reward function:

′ −0.04 if (small penalty) for nonterminal states
R(s, a, s ):=
±1 if for terminal states
Markov Decision Process

Motivation:
We are interested in sequential decision problems in a fully observable, stochastic
environment with Markovian transition models and additive reward functions.
Definition 27.1.4. A Markov decision process (MDP) consists of

a set of S of states (with initial state state s0 ∈S),
27.1. SEQUENTIAL DECISION PROBLEMS 543
sets Actions(s) of actions for each state s.

a transition model P (s′ |s, a), and
a reward function R : S→R we call R(s) a reward.
Solving MDPs
Recall: In search problems, the aim is to find an optimal sequence of actions.
In MDPs, the aim is to find an optimal policy π(s) i.e., best action for every
possible state s. (because can’t predict where one will end up)
Definition 27.1.5. In an MDP, a policy is a mapping from states to actions. An

optimal policy maximizes (say) the expected sum of rewards. (MEU)
Example 27.1.6. Optimal policy when state penalty R(s) is 0.04:
Note: When you run against a wall, you stay in your square.
+1 +1
Risk and Reward –1 –1

3 +1
Example 27.1.7. Optimal policy depends on the reward R(s).

R(s) < –1.6284 – 0.4278 < R(s) < – 0.0850
2 –1
+1 +1 +1 +1
1
–1 –1 –1 –1
3 +1
1 2 3 4
R(s) < –1.6284 – 0.4278 < R(s) < – 0.0850 – 0.0221 < R(s) < 0 R(s) > 0
2 –1
(a) (b)
Question: Explain what you see in a qualitative manner!
+1 +1
1
–1 –1
1 2 3 4
– 0.0221 < R(s) < 0 R(s) > 0

(a) (b)
27.2 Utilities over Time

Utility of state sequences

Recall: We cannot observe/assess utility functions, only preferences ⇝ induce
utility functions from rational preferences
Problem: In MDPs we need to understand preferences between sequences of

states.
Definition 27.2.1. We call preferences on reward sequences stationary, iff
[r, r0 , r1 , r2 , . . .]≻[r, r0′ , r1′ , r2′ , . . .] ⇔ [r0 , r1 , r2 , . . .]≻[r0′ , r1′ , r2′ , . . .]
Theorem 27.2.2. For stationary preferences, there are only two ways to combine
rewards over time.
additive rewards: U ([s0 , s1 , s2 , . . .]) = R(s0 ) + R(s1 ) + R(s2 ) + · · ·
discounted rewards: U ([s0 , s1 , s2 , . . .]) = R(s0 )+γR(s1 )+γ 2 R(s2 )+· · · where
γ is called discount factor.
Utilities of State Sequences

Problem: Infinite lifetimes ; additive utilities become infinite.
Possible Solutions:
1. Finite horizon: terminate utility computation at a fixed time T
U ([s0 , . . . , s∞ ]) = R(s0 ) + · · · + R(sT )
; nonstationary policy: π(s) depends on time left.

2. If there are absorbing states: for any policy π agent eventually “dies” with prob-
ability 1 ; expected utility of every state is finite.
3. Discounting: assuming γ < 1, R(s) ≤ Rmax ,
∞
X ∞
X
U ([s0 , . . . , s∞ ]) = γ t R(st ) ≤ γ t Rmax = Rmax /(1 − γ)
t=0 t=0
Smaller γ ; shorter horizon.

Idea: Maximize system gain =
b average reward per time step.
Theorem 27.2.3. The optimal policy has constant gain after initial transient.
Example 27.2.4. Taxi driver’s daily scheme cruising for passengers.

27.2. UTILITIES OVER TIME 545
Utility of States
b expected (discounted) sum of rewards (until
Intuition: Utility of a state =
termination) assuming optimal actions.
Definition 27.2.5. Given a policy π, let st be the state the agent reaches at time
t starting at state s0 . Then the expected utility obtained by executing π starting in
s is given by "∞ #
X
π t
U (s):=E γ R(st )
t=0
we define π ∗s :=argmax π
U (s).
π
Observation 27.2.6. π ∗s is independent from s.

Proof sketch: If π ∗a and π ∗b reach point c, then there is no reason to disagree – or
with π ∗c
Definition 27.2.7. We call π ∗ :=π ∗s for some s the optimal policy.
: Observation 27.2.6 does not hold for finite-horizon policies.

∗
Definition 27.2.8. The utility U (s) of a state s is U π (s).
Utility of States (continued)

Remark: R(s) =
b “short-term reward”, whereas U =
b “long-term reward”.
Given the utilities of the states, choosing the best action is just MEU:
maximize the expected utility of the immediate successors
X
π ∗ (s) = argmax ( P (s′ |s, a) · U (s′ ))
a∈A(s) s′
Example 27.2.9 (Running Example Continued).
Expected Utility Optimal Policy
Question: Why do we go left in (3, 1) and not up? (follow the utility)
27.3 Value/Policy Iteration

Dynamic programming: the Bellman equation

Definition 27.2.8 leads to a simple relationship among utilities of neighboring states:
expected sum of rewards = current reward + γ · exp. reward sum after best action
Theorem 27.3.1 (Bellman equation (1957)).

X
U (s) = R(s) + γ · max U (s′ ) · P (s′ |s, a)
a∈A(s)
s′
We call this equation the Bellman equation

Example 27.3.2. U (1, 1) = −0.04
+ γ max{0.8U (1, 2) + 0.1U (2, 1) + 0.1U (1, 1), up
0.9U (1, 1) + 0.1U (1, 2) left
0.9U (1, 1) + 0.1U (2, 1) down
0.8U (2, 1) + 0.1U (1, 2) + 0.1U (1, 1)} right
Problem: One equation/state ; n nonlinear (max isn’t) equations in n unknowns.
; cannot use linear algebra techniques for solving them.
Value Iteration Algorithm

Idea: We use a simple iteration scheme to find a fixpoint:
1. start with arbitrary utility values,
2. update to make them locally consistent with the Bellman equation,
3. everywhere locally consistent ; global optimality.
Definition 27.3.3. The value iteration algorithm for utility functions is given by
function VALUE−ITERATION (mdp,ϵ) returns a utility fn.
inputs: mdp, an MDP with states S, actions A(s), transition model P (s′ |s, a),
rewards R(s), and discount γ
ϵ, the maximum error allowed in the utility of any state
local variables: U , U ′ , vectors of utilities for states in S, initially zero
δ, the maximum change in the utility of any state in an iteration
repeat
U := U ′ ; δ := 0
for each state s in S do
U ′ [s] := R(s) + γ · max ( s′ U [s′ ] · P (s′ |s, a))
P
a
if |U ′ [s] − U [s]| > δ then δ := |U ′ [s] − U [s]|
until δ < ϵ(1 − γ)/γ
return U
Section
Section 17.2.
17.2. Value
Value Iteration
Iteration 653
653
function VALUE
function V ALUE-I-ITERATION
TERATION(mdp, (mdp, !) returns aa utility
!) returns utility function
function
inputs: mdp, an
inputs: mdp, an MDP
MDP with with states
states SS,, actions
actions A(s), transitionmodel
A(s), transition modelPP(s (s! !||s,s,a),
a),
rewards R(s), discount
rewards R(s), discount γ γ
the maximum
!, the
!, maximum error error allowed
allowed in in the
the utility
utility of
ofany
anystate
state
27.3. VALUE/POLICY
local ITERATION
local variables:
variables: U U !!,, vectors
U ,, U vectors of
of utilities
utilities for
for states
states in
in SS,,initially
initiallyzero
zero 547
the maximum
δ, the
δ, maximum change
change in in the
the utility
utility of
ofany
anystate
statein
inan
aniteration
iteration
P
Remark: ′ ′
repeat Retrieve the optimal policy with π[s]:=argmax ( s′ U [s ] · P (s |s, a))
repeat
U←
U ←U U !!;; δδ ←
← 00 a
for each state s in S do !
!
U !! [s] ← R(s) + γ max max PP(s(s!! ||s,
s,a)
a) U [s!!]]
U[s
∈ A(s)
aa ∈ A(s) 896 2023-02-10
!
ss!
!
then δδ ←
if |U ! [s] − U [s]| > δδ then |U!![s]
←|U −U
[s] − U[s]|
[s]|
until δ < !(1 − γ)/γ
Value Iteration Algorithm (Example)
return U
Figure 17.4 The value iteration

iteration algorithm
algorithm for
for calculating
calculatingutilities
utilitiesof
ofstates.
states. The
Thetermina-
termina-
Example 27.3.4 (Iteration on 4x3).
tion condition is from Equation
Equation (17.8).
(17.8).
1e+07
1e+07
1 (4,3)
(4,3) cc==0.0001
0.0001
(3,3)
(3,3) 1e+06
1e+06 cc==0.001
0.001
0.8 cc==0.01
0.01
(1,1)
(1,1)
Iterations required
cc==0.1
0.1
required
100000
100000
Utility estimates
0.6 (3,1)
(3,1)
10000
10000
0.4 (4,1)
(4,1)
Iterations
1000
1000
0.2
100
100
0
10
10
-0.2
11
0 5 10 15 20
20 25
25 30
30 0.5
0.50.55
0.550.6
0.60.65
0.650.7
0.70.75
0.750.8
0.80.85
0.850.9
0.90.95
0.95 11
Number of iterations
iterations Discount
Discountfactor
factor
(a) (b)
(b)
Figure 17.5 (a) Graph

Michael showing
showing
Kohlhase: the
the evolution
evolution
Artificial Intelligence 2 of
of the
the utilities
utilities
897 of
ofselected
selected states
statesusing
2023-02-10 usingvalue
value
iteration. (b) The number of of value
value iterations
iterations kk required
required to to guarantee
guarantee anan error
error of
of atatmost
most
max, for different values
! = c · Rmax values of c, as
of c, as aa function
function of
of the
thediscount
discountfactor
factorγ.
γ.
Convergence
where the update is assumed to be be applied
applied simultaneously
simultaneously to to all
all the
the states
states atateach
eachiteration.
iteration.
we apply the
IfDefinition BellmanThe
27.3.5. update infinitely
maximum often,
infinitelynorm
often,∥Uwe
we∥are
=guaranteed
are max |U (s)|,
guaranteed to
to reach
so ∥U
reach an
anequilibrium
− V∥ =
equilibrium
s
(see Section
(see Section 17.2.3),
17.2.3), in which
which case
inbetween case the
the final
final utility values must be solutions to the Bellman
maximum difference U and V . utility values must be solutions to the Bellman
equations. In
equations. In fact,
fact, they
they are
are also
also the unique solutions,
the unique solutions, and
and the
the corresponding
corresponding policy policy(obtained
(obtained
using Equation
Let UEquation
using t
and U t+1(17.4)) is
is optimal.
be successive
(17.4)) The
The algorithm,
algorithm, called
optimal. approximations to theVVtrue
called ALUE
ALUE -I-ITERATION
utility U . , , isis shown
TERATION showninin
Figure 17.4.
Figure 17.4.
Theorem
We can27.3.6.
We can apply For any
apply value
value two to
iteration
iteration approximations
to the
the 44××33 world
t
worldUin
in and V t17.1(a).
Figure
Figure 17.1(a). Starting
Startingwith
withinitial
initial
values of
values of zero,
zero, the
the utilities
utilities evolve
t+1as
evolve as shown
shown in
in Figure

Figure 17.5(a).
17.5(a). Notice
Notice how
how the
the states
states at
atdiffer-
differ-
U − V t+1 ≤ γ U t − V t
I.e., any distinct approximations must get closer to each other

so, in particular, any approximation must get closer to the true U
and value iteration converges to a unique, stable, optimal solution.

Theorem 27.3.7. If U t+1 − U t < ϵ, then U t+1 − U < 2ϵγ/1 − γ
I.e., once the change in U t becomes small, we are almost done.
MEU policy using U t may be optimal long before convergence of values.
Policy Iteration
Recap: Value iteration computes utilities ; optimal policy by MEU.
This even works if the utility estimate is inaccurate. (⇝ policy loss small)
Idea: search for optimal policy and utility values simultaneously [How60]: Iterate
policy evaluation: given policy πi , calculate Ui = U πi , the utility of each state
were πi to be executed.
policy improvement: calculate a new MEU policy πi+1 using 1 lookahead
Terminate if policy improvement yields no change in utilities.
Observation 27.3.8. Upon termination Ui is a fixpoint of Bellman update
; Solution to Bellman equation ; πi is an optimal policy.
Observation 27.3.9. Policy improvement improves policy and policy space is finite
; termination.
Policy Iteration Algorithm

Definition 27.3.10. The policy iteration algorithm is given by the following pseu-
docode:
function POLICY−ITERATION(mdp) returns a policy
inputs: mdp, and MDP with states S, actions A(s), transition model P (s′ |s, a)
local variables: U a vector of utilities for states in S, initially zero
π a policy indexed by state, initially random,
repeat
U := POLICY−EVALUATION(π,U ,mdp)
unchanged? := true
foreach state Ps in X do
if max ( s′ P (s′ |s, a) · U (s′ )) > s′ P (s′ |s, π[s′ ]) · U (s′ ) then do
P
a∈A(s)
π[s] := argmax ( s′ P (s′ |s, b) · U (s′ ))
P
b∈A(s)
unchanged? := false
until unchanged?
return π
Policy Evaluation
Problem: How to implement the POLICY−EVALUATION algorithm?
Solution: To compute utilities given a fixed π: For all s we have
X
U (s) = R(s) + γ( U (s′ ) · P (s′ |s, π(s)))
s′
Example 27.3.11 (Simplified Bellman Equations for π).

27.4. PARTIALLY OBSERVABLE MDPS 549
U i (1, 1) = −0.04 + 0.8U i (1, 2) + 0.1U i (1, 1) + 0.1U i (2, 1)

U i (1, 2) = −0.04 + 0.8U i (1, 3) + 0.1U i (1, 2)
..
.
Observation 27.3.12. n simultaneous linear equations in n unknowns, solve in

O(n3 ) with standard linear algebra methods.
Modified Policy Iteration

Policy iteration often converges in few iterations, but each is expensive.
Idea: Use a few steps of value iteration (but with π fixed)
starting from the value function produced the last time
to produce an approximate value determination step.
Often converges much faster than pure VI or PI.

Leads to much more general algorithms where Bellman value updates and Howard
policy updates can be performed locally in any order.
Remark: Reinforcement learning algorithms operate by performing such updates
based on the observed transitions made in an initially unknown environment.
27.4 Partially Observable MDPs

We will now lift the last restriction we made in the decision problems for our agents: in the
definition of Markov decision processes we assumed that the environment was fully observable. As
we have seen Observation 27.2.6 this entails that the optimal policy only depends on the current
state. A Video Nugget covering this section can be found at https://fau.tv/clip/id/
30360.
Partial Observability
Definition 27.4.1. A partially observable MDP (a POMDP for short) is a MDP
together with an sensor model O that has the sensor Markov property and is sta-
tionary: O(s, e) = P (e|s).
Example 27.4.2 (Noisy 4x3 World).
Add a partial and/or noisy sensor.

e.g. count number of adjacent walls (1 ≤ w ≤ 2)
with 0.1 error (noise)
If sensor reports 1, we are in (3, ?) (probably)
Problem: Agent does not know which state it is in ; makes no sense to talk
about policy π(s)!
Theorem 27.4.3 (Astrom 1965). The optimal policy in a POMDP is a function
π(b) where b is the belief state (probability distribution over states).
Idea: Convert a POMDP into an MDP in belief state space, where T (b, a, b′ ) is
the probability that the new belief state is b′ given that the current belief state is b
and the agent does a. I.e., essentially a filtering update step.
POMDP: Filtering at the Belief State Level

Recap: Filtering updates the belief state for new evidence.
For POMDPs, we also need to consider actions. (but the effect is the same)
If b(s) is the previous belief state and agent does action a and then perceives e,
then the new belief state is
X
b′ (s′ ) = α · P (e|s′ ) · ( P (s′ |s, a) · b(s))
s
We write b′ = FORWARD(b, a, e) in analogy to recursive state estimation.

Fundamental Insight for POMDPs: The optimal action only depends on the
agent’s current belief state. (good, it does not know the state!)
Consequence: The optimal policy can be written as a function π ∗ (b) from belief
states to actions.
Definition 27.4.4. The POMDP decision cycle is to iterate over
1. Given the current belief state b, execute the action a = π ∗ (b)
2. Receive percept e.
3. Set the current belief state to FORWARD(b, a, e) and repeat.
Intuition: POMDP decision cycle is search in belief state space.
Partial Observability contd.

Recap: The POMDP decision cycle is search in belief state space.
Observation 27.4.5. Actions change the belief state, not just the physical state.
Thus POMDP solutions automatically include information gathering behavior.
Problem: The belief state is continuous: If there are n states, b is an n-dimensional
real-valued vector.
27.5. ONLINE AGENTS WITH POMDPS 551
Example 27.4.6. The belief state of the 4x3 world is a 11 dimensional continuous
space. (11 states)
Theorem 27.4.7. Solving POMDPs is very hard! (actually, PSPACE hard)
In particular, none of the algorithms we have learned applies. (discreteness

assumption)
The real world is a POMDP (with initially unknown transition model T and sensor
model O)
1
EdN:1
27.5 Online Agents with POMDPs

Designing Online Agents for POMDPs
Definition 27.5.1 (Dynamic Decision Networks).
Transition and sensor model are represented as a DBN (a dynamic Bayesian

network).
action nodes and utility nodes are added to create a dynamic decision network
(DDN).
a filtering algorithm is used to incorporate each new percept and action and to
update the belief state representation.
decisions are made by projecting forward possible action sequences and choosing
the best one.
Generic structure of a dymamic decision network at time t
Variables with known values are gray, agent must choose a value for At .
Rewards for t = 0, . . . , t + 2, but utility for t + 3 (=
b discounted sum of rest)
1 EdNote: Commented out something about algorithms, make planning-based slides after AIMA3
Designing Online Agents for POMDPs (continued)
Part of the lookahead solution of the DDN above (search over action tree)
circle =
b chance nodes (the environment decides) triangle =
b belief state (each
action decision is taken there)
Designing Online Agents for POMDPs (continued)
Note: belief state update is deterministic irrespective of the action outcome

; no chance nodes for action outcomes
belief state at triangle computed by filtering with actions/percepts leading to it

for decision At+i will have percepts Et+1:t+i (even if it does not know their
values at time t)
A POMDP agent automatically takes into account the value of information and
executes information-gathering actions where appropriate.
d d
Time complexity for exhaustive search up to depth d is O(|A| · |E| ) (|A| =
b
number of actions, |E| =
b number of percepts)
27.5. ONLINE AGENTS WITH POMDPS 553
Summary
Decision theoretic agents for sequential environments
Building on temporal, probabilistic models/inference (dynamic Bayesian networks)
MDPs for fully observable case.

Value/Policy Iteration for MDPs ; optimal policies.
POMDPs for partially observable case.
POMDPs=
b MDP on belief state space.
The world is a POMDP with (initially) unknown transition and sensor models.

Part VI
Machine Learning
555
557
This part introduces the foundations of machine learning methods in AI. We discuss the prob-
lem learning from observations in general, study inference-based techniques, and then go into
elementary statistical methods for learning.
558
Chapter 28
A Video Nugget covering the introduction to this chapter can be found at https://fau.tv/
clip/id/30369.
Outline
Learning agents
Inductive learning
Decision tree learning
Measuring learning performance

Computational Learning Theory
Linear regression and classification
Neural Networks
Support Vector Machines
28.1 Forms of Learning

Learning (why is this a good idea)

Learning is essential for unknown environments:
i.e., when designer lacks omniscience.
The world is a POMDP with (initially) unknown transition and sensor models.
Learning is useful as a system construction method.

i.e., expose the agent to reality rather than trying to write it down
Learning modifies the agent’s decision mechanisms to improve performance.
559
560 CHAPTER 28. LEARNING FROM OBSERVATIONS
Recap: Learning Agents
Recap: Learning Agents (continued)
Definition 28.1.1. Performance element is what we called “agent” up to now.

Definition 28.1.2. Critic/learning element/problem generator do the “improving”.
Definition 28.1.3. Performance standard is fixed; (outside the environment)

We can’t adjust performance standard to flatter own behaviour!
No standard in the environment: e.g. ordinary chess and suicide chess look
identical.
Essentially, certain kinds of percepts are “hardwired” as good/bad (e.g.,pain,
hunger)
Definition 28.1.4. Learning element may use knowledge already acquired in the
performance element.
28.2. INDUCTIVE LEARNING 561
Definition 28.1.5. Learning may require experimentation actions an agent might

not normally consider such as dropping rocks from the Tower of Pisa.
Learning Element
Observation: The design of learning element is dictated by
what type of performance element is used,

which functional component is to be learned,
how that functional component is represented,
what kind of feedback is available.
Example 28.1.6 (Learning Scenarios).
Performance Elt. Component Representation Feedback
Alpha-beta search Evaluation fn. Weighted linear fn. Win/loss
Logical agent transition model Successor state ax. Outcome
Utility based agent transition model Dynamic Bayes net Outcome
Simple reflex agent Percept action fn. Neural net Corr. Action
Preview:
Supervised learning: correct answers for each instance
Reinforcement learning: occasional rewards
Note:
1. Learning transition models is “supervised” if observable.
2. Supervised learning of correct actions requires “teacher”.
3. Reinforcement learning is harder, but requires no teacher.
28.2 Inductive Learning

Inductive learning (a.k.a. Science)
Simplest form: Learn a function from arg/value examples. (tabula rasa)

Definition 28.2.1. An example is a pair (x,y) of an input sample x and a classifi-
cation y. We call a set S of examples consistent, iff S is a function.

O O X
Example 28.2.2 (Examples in Tic-Tac-Toe). X , +1
X
Definition 28.2.3. The inductive learning problem P:=⟨H, f ⟩ consists in finding a

hypothesis h∈H such that f ≃ h|dom(f ) for a consistent training set f of examples
and a hypothesis space H. We also call f the target function.
Inductive learning algorithms solve this problem.

Note: This is a highly simplified model of what a learning agent does: it
ignores prior knowledge.
assumes a deterministic, observable “environment”.
assumes examples are given.
assumes that the agent wants to learn f . (why?)
Inductive Learning Method

Idea: Construct/adjust hypothesis h∈H to agree with a training set f .
Definition 28.2.4. We call h consistent with f (on a set T ⊆ dom(f )), if it

agrees with f on all examples in T .
Example 28.2.5 (Curve Fitting).
28.2. INDUCTIVE LEARNING 563
Training SetLinear Hypothesis

partially, approximatively consis-
tent Quadratic Hypothesis
partially consistent Degree-4 Hy-
pothesis
consistent High-degree Hypothesis
consistent
Ockham’s-razor: maximize a combination of consistency and simplicity.
Choosing the Hypothesis Space

Observation: Whether we can find a consistent hypothesis for a given training
set depends on the chosen hypothesis space.
Definition 28.2.6. We say that an inductive learning problem ⟨H, f ⟩ is realizable,

iff there is a h∈H consistent with f .
Problem: We do not know whether a given learning problem is realizable, unless
we have prior knowledge.
Solution: Make H large, e.g. the class of all Turing machines.

Tradeoff: The computational complexity of the inductive learning problem is tied
to the size of the hypothesis space. E.g. consistency is not even decidable for
general Turing machines.
Much of the research in machine learning has concentrated on simple hypothesis
spaces.
Preview: We will concentrate on propositional logic and related languages first.
28.3 Learning Decision Trees

Attribute-based Representations
Definition 28.3.1. In attribute based representations, examples are described by
attributes: (simple) functions on input samples, (think pre classifiers on
examples)
their value, and (classify by attributes)
classifications. (Boolean, discrete, continuous, etc.)
Example 28.3.2 (In a Restaurant). Situations where I will/won’t wait for a table:
28.3. LEARNING DECISION TREES 565
Attributes Target
Example Alt Bar F ri Hun P at P rice Rain Res T ype Est WillWait
X1 T F F T Some $$$ F T French 0–10 T
X2 T F F T Full $ F F Thai 30–60 F
X3 F T F F Some $ F F Burger 0–10 T
X4 T F T T Full $ F F Thai 10–30 T
X5 T F T F Full $$$ F T French >60 F
X6 F T F T Some $$ T T Italian 0–10 T
X7 F T F F None $ T F Burger 0–10 F
X8 F F F T Some $$ T T Thai 0–10 T
X9 F T T F Full $ T F Burger >60 F
X 10 T T T T Full $$$ F T Italian 10–30 F
X 11 F F F F None $ F F Thai 0–10 F
X 12 T T T T Full $ F F Burger 30–60 T
Definition 28.3.3. Classification of examples is positive (T) or negative (F).
Decision Trees
Decision trees are one possible representation for hypotheses.
Example 28.3.4 (Restaurant continued).

Here is the “true” tree for deciding whether to wait:
We evaluate the tree by going down the tree from the top, and always take the branch whose
attribute matches the situation; we will eventually end up with a Boolean value; the result. Using
the attribute values from X3 in Example 28.3.2 to descend through the tree in Example 28.3.4 we
indeed end up with the result “true”. Note that
1. some of the original set of attributes X3 are irrelevant.
2. the training set in Example 28.3.2 is realizable – i.e. the target is definable in hypothesis class
of decision trees.
Decision Trees (Definition)

Definition 28.3.5. A decision tree for a given attribute based representation is a
tree, where the non-leaf nodes are labeled by attributes, their outgoing edges by the
corresponding attribute values, and the leaf nodes are labeled by the classifications.
Expressiveness
Decision trees can express any function of the input attributes.
Example 28.3.6. for Boolean functions, truth table row ; path to leaf:
Trivially, there is a consistent decision tree for any training set

with one path to leaf for each example (unless f nondeterministic in x)
but it probably won’t generalize to new examples.
Solution: Prefer to find more compact decision trees.
Hypothesis Spaces
Question: How many distinct decision trees are there with n Boolean attributes?
Question: How many purely conjunctive hypotheses? (e.g., Hungry ∧ ¬Rain)
Decision Tree learning

Aim: Find a small decision tree consistent with the training examples.
Idea: (recursively) choose “most significant” attribute as root of (sub)tree.
Definition 28.3.7. The following algorithm performs decision tree learning (DTL)
function DTL(examples, attributes, def ault) returns a decision tree
28.4. USING INFORMATION THEORY 567
if examples is empty then return def ault

else if all examples have the same classification then return the classification
else if attributes is empty then return MODE(examples)
else
best := Choose−Attribute(attributes, examples)
tree := a new decision tree with root test best
m := MODE(examples)
for each value vi of best do
examplesi := {elements of examples with best = vi }
subtree := DTL(examplesi , attributes \ best, m)
add a branch to tree with label vi and subtree subtree
return tree
MODE(examples)= most frequent value in example.
Note: We have three base cases:

1. empty examples ⇝ arises for empty branches of non Boolean parent attribute.
2. uniform example classifications ⇝ this is “normal” leaf.
3. attributes empty ⇝ target is not deterministic in input attributes.

The recursive step steps pick an attribute and then subdivides the examples.
Choosing an attribute
Idea: a good attribute splits the examples into subsets that are (ideally) “all
positive” or “all negative”.
Example 28.3.8.
Attribute “Patrons?” is a better choice, it gives gives information about the classi-
fication.
Can we make this more formal? ; use information theory! (up next)
28.4 Using Information Theory

Information Entropy
Intuition: Information answers questions.

The more clueless I am about the answer initially, the more information is contained
in the answer.
Scale: 1b =
b 1 bit =
b answer to Boolean question with prior probability (0.5,0.5).
Definition 28.4.1.
If the prior probability is ⟨P 1 , . . ., P n ⟩, then the information in an answer (also called
entropy of the prior) is
n
X
I(⟨P 1 , . . ., P n ⟩):= −P i · log2 (P i )
i=1
Note: The case P i = 0 requires special treatment. (log2 (0) is undefined)

Example 28.4.2 (Information of a Coin Toss).
For a fair coin toss we have I(⟨ 21 , 12 ⟩) = − 12 log2 ( 12 ) − 12 log2 ( 21 ) = 1b.

1 99
With a loaded coin (99% heads) we have I(⟨ 100 , 100 ⟩) = 0.08b.
Intuition: Information goes to 0 as head probability goes to 1.
Information Gain in Decision Trees

Suppose we have p examples classified as positive and n examples as negative.
Idea: We can estimate the probability distribution of the classification C with
P(C) = ⟨p/(p + n), n/(p + n)⟩.
Then I(P(C)) bits are needed to classify a new example.
Example 28.4.3. For 12 restaurant examples, p = n = 6 so we need I(P(WillWait)) =

6 6
I(⟨ 12 , 12 ⟩) = 1b of information.
Treating attributes also as random variables, we can compute how much information
is needed after knowing the value for one attribute.
Example 28.4.4. If we know Pat = Full, we only need I(P(WillWait|Pat =

Full)) = I(⟨ 64 , 26 ⟩) bits of information.
Note: The expected number of bits needed after an attribute test on A is
X
P (A = a) · I(P(C|A = a))
a
Definition 28.4.5. The information gain from an attribute test A is

X
Gain(A):=I(P(C)) − P (A = a) · I(P(C|A = a))
a
Information Gain (continued)

Definition 28.4.6. Assume we know the results of some attribute tests b:=B1 =
b1 ∧ . . . ∧ Bn = bn . Then the conditional information gain from an attribute test A
is X
Gain(A|b):=I(P(C|b)) − P (A = a|b) · I(P(C|a, b))
a
Example 28.4.7. If the classification C is boolean and we have p positive and n

negative examples, the information gain is
p n X pa + na pa na
Gain(A) = I(⟨ , ⟩) − I(⟨ , ⟩)
p+n p+n a
p+n pa + na pa + na
where pa and na are the positive and negative examples with A = a.

Example 28.4.8.
2 4 6 2 4
Gain(P atrons?) = 1− I(⟨0, 1⟩) + I(⟨1, 0⟩) + I(⟨ , ⟩)
12 12 12 6 6
≈ 0.541b
2 1 1 2 1 1 4 2 2 4 2 2
Gain(T ype) = 1− I(⟨ , ⟩) + I(⟨ , ⟩) + I(⟨ , ⟩) + I(⟨ , ⟩)
12 2 2 12 2 2 12 4 4 12 4 4
≈ 0b
Idea: Choose the attribute that maximizes information gain.
Restaurant Example contd.

Example 28.4.9. Decision tree learned by DTL from the 12 examples using infor-
mation gain maximization for Choose−Attribute:
Result: Substantially simpler than “true” tree – a more complex hypothesis isn’t
justified by small amount of data.
Performance measurement
Question: How do we know that h ≈ f ? (Hume’s Problem of Induction)
1. Use theorems of computational/statistical learning theory.
2. Try h on a new test set of examples. (use same distribution over example space
as training set)
Definition 28.4.10. The learning curve =
b percentage correct on test set as a
function of training set size.
Example 28.4.11. Restaurant data; graph averaged over 20 trials
Performance measurement contd.

Observation 28.4.12. The learning curve depends on
realizable (can express target function) vs. non-realizable

non-realizability can be due to missing attributes or restricted hypothesis class
(e.g., thresholded linear function)
redundant expressiveness (e.g., lots of irrelevant attributes)
Generalization and Overfitting

Observation: Sometimes a learned hypothesis is more specific than the experi-
ments warrant.
Definition 28.4.13. We speak of overfitting, if a hypothesis h describes random er-
ror in the (limited) training set rather than the underlying relationship. Underfitting
occurs when h cannot capture the underlying trend of the data.
Qualitatively: Overfitting increases with the size of hypothesis space and the
number of attributes, but decreases with number of examples.
Idea: Combat overfitting by “generalizing” decision trees computed by DTL.
Decision Tree Pruning

Idea: Combat overfitting “generalizing” decision trees ; prune “irrelevant” nodes.
Definition 28.4.14. For decision tree pruning repeat the following on a learned
decision tree:
Find a terminal test node n (only result leaves as children)

if test is irrelevant, i.e. has low information gain, prune it by replacing n by with
a leaf node.
Question: How big should the information gain be to split (; keep) a node?
Idea: Use a statistical significance test.

Definition 28.4.15. A result has statistical significance, if the probability they
could arise from the null hypothesis (i.e. the assumption that there is no underlying
pattern) is very low (usually 5%).
Determining Attribute Irrelevance

For decision tree pruning, the null hypothesis is that the attribute is irrelevant.
Compute the probability that the example distribution (p positive, n negative) for
a terminal node deviates from the expected distribution under the null hypothesis.
For an attribute A with d values, compare the actual numbers pk and nk in each
subset sk with the expected numbers (expected if A is irrelevant)
pk +nk pk +nk
pbk = p · p+n and n bk = n · p+n .
A convenient measure of the total deviation is (sum of squared errors)
d
X 2 2
(pk − pbk ) (nk − n
bk )
∆= +
pbk bk
n
k=1
Lemma 28.4.16 (Neyman-Pearson). Under the null hypothesis, the value of ∆

is distributed according to the χ2 distribution with d − 1 degrees of freedom. [JN33]
Definition 28.4.17. Decision tree pruning with Pearson’s χ2 with d − 1 degrees
of freedom for ∆ is called χ2 pruning. (χ2 values from stats library.)
Example 28.4.18. The type attribute has four values, so three degrees of freedom,
so ∆ = 7.82 would reject the null hypothesis at the 5% level.
28.5 Evaluating and Choosing the Best Hypothesis

Independent and Identically Distributed

Problem: We want to learn a hypothesis that fits the future data best.
Intuition: This only works, if the training set is “representative” for the underlying
process.
Idea: We think of examples (seen and unseen) as a sequence, and express the
“representativeness” as a stationarity assumption for the probability distribution.
Method: Each example before we see it is a random variable Ej , the observed
value ej = (xj ,yj ) samples its distribution.
Definition 28.5.1.
A sequence of E 1 , . . ., E n of random variables is independent and identically dis-
tributed (short IID), iff they are
independent, i.e. P(E j |E (j−1) , E (j−2) , . . .) = P(E j ) and
identically distributed, i.e. P(E i ) = P(E j ) for all i and j.
Example 28.5.2. A sequence of die tosses is IID. (fair or loaded does not matter)
Stationarity Assumption: We assume that the set E of examples is IID in the
future.
Error Rates and Cross-Validation

Recall: We want to learn a hypothesis that fits the future data best.
28.5. EVALUATING AND CHOOSING THE BEST HYPOTHESIS 573
Definition 28.5.3. Given an inductive learning problem ⟨H, f ⟩, we define the error
rate of a hypothesis h∈H as the fraction of errors:
#({x ∈ dom(f ) | h(x) ̸= f (x)})

#(dom(f ))
Caveat: A low error rate on the training set does not mean that a hypothesis
generalizes well.
Idea: Do not use homework questions in the exam.
Definition 28.5.4. The practice of splitting the data available for learning into
1. a training set from which the learning algorithm produces a hypothesis h and
2. a test set, which is used for evaluating h
is called holdout cross validation. (no peeking at test set allowed)
Error Rates and Cross-Validation

Question: What is a good ratio between training set and test set size?
small training set ; poor hypothesis.
small test set ; poor estimate of the accuracy.
Definition 28.5.5. In k fold cross validation, we perform k rounds of learning,
each with 1/k of the data as test set and average over the k error rates.
Intuition: Each example does double duty: for training and testing.
k = 5 and k = 10 are popular ; good accuracy at k times computation time.
Definition 28.5.6. If k = #(dom(f )), then k fold cross validation is called leave
one out cross validation (LOOCV).
Model Selection
Definition 28.5.7. The model selection problem is to determine – given data – a
good hypothesis space.
Example 28.5.8. What is the best polynomial degree to fit the data
Observation 28.5.9. We can solve the problem of “learning from observations f ”

in a two-part process:
1. model selection determines a hypothesis space H,
2. optimization solves the induced inductive learning problem ⟨H, f ⟩.
Idea: Solve the two parts together by iteration over “size”. (they inform each
other)
Problem: Need a notion of “size” ⇝ e.g. number of nodes in a decision tree.
Concrete Problem: Find the “size” that best balances overfitting and underfitting
to optimize test set accuracy.
Model Selection Algorithm (Wrapper)

function CROSS−VALIDATION−WRAPPER(Learner,k,examples) returns a hypothesis
local variables: errT , an array, indexed by size, storing training−set error rates
errV , an array, indexed by size, storing validation−set error rates
for size = 1 to ∞ do
errT [size], errV [size] := CROSS−VALIDATION(Learner,size,k,examples)
if errT has converged then do
best_size := the value of size with minimum errV [size]
return Learner(best_size,examples)
function CROSS−VALIDATION(Learner,size,k,examples) returns two values:

average training set error rate, average validation set error rate
f old_errT := 0; f old_errV := 0
for fold = 1 to k do
training_set, validation_set := PARTITION(examples,f old,k)
h := Learner(size,training_set)
f old_errT := f old_errT + ERROR−RATE(h,training_set)
f old_errV := f old_errV + ERROR−RATE(h,validation_set)
return f old_errT /k, f old_errV /k
function PARTITION(examples,f old,k) returns two sets:

a validation set of size |examples|/k and the rest; the split is different for each f old value
Error Rates on Training/Validation Data
Example 28.5.10 (An Error Curve for Restaurant Decision Trees).

Modify DTL to be breadth-first, information gain sorted, stop after k nodes.
60
Validation Set Error
Training Set Error
50
40
Error rate
30
20
10
0
1 2 3 4 5 6 7 8 9 10
Tree size
Stops when training set error rate converges, choose optimal tree for validation
curve. (here a tree with 7 nodes)
From Error Rates to Loss Functions

So far we have been minimizing error rates. (better than maximizing ,)
Example 28.5.11 (Classifying Spam). It is much worse to classify ham (legitimate
mails) as spam than vice versa. (message loss)
Recall Rationality: Decision-makers should maximize expected utility (MEU).

So: Machine learning should maximize “utility”. (not only minimize error rates)
Machine learning traditionally deals with utilities in form of “loss functions”.
Definition 28.5.12. The loss function L is defined by setting L(x, y, yb) to be
the amount of utility lost by prediction h(x) = yb instead of f (x) = y. If L is
independent of x, we often use L(y, yb).
Example 28.5.13. L(spam, ham) = 1, while L(ham, spam) = 10.
Generalization Loss
Note: L(y, y) = 0. (no loss if you are exactly correct)
Definition 28.5.14 (Popular general loss functions).
absolute value loss L1 (y, yb):= |y − yb| small errors are good
2
squared error loss L2 (y, yb):=(y − yb) dito
0/1 loss L0/1 (y, yb):=0, if y = yb, else 1 error rate
Idea: Maximize expected utility by choosing hypothesis h that minimizes expec-
tationexpected loss over all (x,y)∈f .
Definition 28.5.15. Let E be the set of all possible examples and P(X, Y ) the
prior probability distribution over its components, then the expected generalization
loss for a hypothesis h with respect to a loss function L is
X
GenLossL (h):= L(y, h(x)) · P (x, y)
(x,y)∈E
and the best hypothesis h∗ :=argmin GenLossL (h).

h∈H
Empirical Loss
Problem: P(X, Y ) is unknown ; learner can only estimate generalization loss:
Definition 28.5.16. Let L be a loss function and E a set of examples with

#(E) = N , then we call
1 X
EmpLossL,E (h):= ( L(y, h(x)))
N
(x,y)∈E
the empirical loss and b

h∗ :=argmin EmpLossL,E (h) the estimated best hypothesis.
h∈H
There are four reasons why b

h∗ may differ from f :
Realizablility: then we have to settle for an approximation b
h∗ of f .
Variance: different subsets of f give different b
h∗ ; more examples.
Noise: if f is non deterministic, then we cannot expect perfect results.
Computational complexity: if H is too large to systematically explore, we make
due with subset and get an approximation.
Regularization
Idea: Directly use empirical loss to solve model selection. (finding a good H)
Minimize the weighted sum of empirical loss and hypothesis complexity. (to avoid
overfitting).
Definition 28.5.17. Let λ∈R, h∈H, and E a set of examples, then we call
CostL,E (h):=EmpLossL,E (h) + λComplexity(h)
the total cost of h on E.

Definition 28.5.18. The process of finding a total cost minimizing hypothesis
b
h∗ :=argmin CostL,E (h)
h∈H
is called regularization; Complexity is called the regularization function or hypothesis

complexity.
Example 28.5.19 (Regularization for Polynomials).
A good regularization function for polynomials is

the sum of squares of exponents. ; keep away
from wriggly curves!
Minimal Description Length

Remark: In regularization, empirical loss and hypothesis complexity are not mea-
sured in the same scale ; λ mediates between scales.
Idea: Measure both in the same scale ; use information content, i.e. in bits.
Definition 28.5.20. Let h∈H be a hypothesis and E a set of examples, then the
description length of (h,E) is computed as follows:
1. encode the hypothesis as a Turing machine program, count bits.

2. count data bits:
correctly predicted example ; 0b
incorrectly predicted example ; according to size of error.
The minimum description length or MDL hypothesis minimizes the total number of
bits required.
This works well in the limit, but for smaller problems there is a difficulty in that the
choice of encoding for the program affects the outcome.
e.g., how best to encode a decision tree as a bit string?
The Scale of Machine Learning

Traditional methods in statistics and early machine learning concentrated on small-
scale learning (50-5000
examples)
Generalization error mostly comes from
approximation error of not having the true f in the hypothesis space
estimation error of too few training examples to limit variance.
In recent years there has been more emphasis on large-scale learning. (millions of
examples)
Generalization error is dominated by limits of computation

there is enough data and a rich enough model that we could find an h that
is very close to the true f ,
but the computation to find it is too complex, so we settle for a sub-optimal
approximation.
Hardware advances (GPU farms, Amazon EC2, Google Data Centers, . . . ) help.
28.6 Computational Learning Theory

A (General) Theory of Learning?

Main Question: How can we be sure that our learning algorithm has produced a
hypothesis that will predict the correct value for previously unseen inputs?
Formally: How do we know that the hypothesis h is close to the target function
f if we don’t know what f is?
Other - more recent - Questions:
How many examples do we need to get a good h?
What hypothesis space H should we use?
If the H is very complex, can we even find the best h, or do we have to settle
for a local maximum in H.
How complex should h be?
How do we avoid overfitting?

“Computational Learning Theory” tries to answer these using concepts from AI,
statistics, and theoretical CS.
PAC Learning
Basic idea of Computational Learning Theory:
Any hypothesis h that is seriously wrong will almost certainly be “found out”
with high probability after a small number of examples, because it will make an
incorrect prediction.
Thus, if h is consistent with a sufficiently large set of training examples is unlikely
to be seriously wrong.
; h is probably approximately correct.
28.6. COMPUTATIONAL LEARNING THEORY 579
Definition 28.6.1. Any learning algorithm that returns hypotheses that are prob-
ably approximately correct is called a PAC learning algorithm.
Derive performance bounds for PAC learning algorithms in general, using the
Stationarity Assumption (again): We assume that the set E of possible examples

is IID ; we have a fixed distribution P(E) = P(X, Y ) on examples.
Simplifying Assumptions: f is a function (deterministic) and f ∈H.
PAC Learning
Start with PAC theorems for Boolean functions, for which L0/1 is appropriate.
Definition 28.6.2. The error rate error(h) of a hypothesis h is the probability that
h misclassifies a new example.
X
error(h):=GenLossL0/1 (h) = L0/1 (y, h(x)) · P (x, y)
(x,y)∈E
Intuition: error(h) is the probability that h misclassifies a new example.
This is the same quantity as measured in the learning curves above.

Definition 28.6.3. A hypothesis h is called approximatively correct, iff error(h)≤ϵ
for some small ϵ>0.
We write Hb :={h ∈ H | error(h)>ϵ} for the “seriously bad” hypotheses.
Sample Complexity
Let’s compute the probability that hb ∈Hb is consistent with the first N examples.
We know error(hb )>ϵ
N
; P (hb agrees with N examples)≤ (1ϵ) . (independence)
N N
; P (Hb contains consistent hyp.)≤#(Hb ) · (1 − ϵ) ≤#(H) · (1 − ϵ) .
(Hb ⊆ H)
; to bound this by a small δ, show the algorithm N ≥ 1ϵ · (log2 ( 1δ ) + log2 (#(H)))
examples.
Definition 28.6.4. The number of required examples as a function of ϵ and δ is
called the sample complexity of H.
n
Example 28.6.5. If H is the set of n-ary Boolean functions, then #(H) = 22 .
n
; sample complexity grows with O(log2 (22 )) = O(2n ).
There are 2 possible examples,
n
; PAC learning for Boolean functions needs to see (nearly) all examples.
Escaping Sample Complexity

Problem: PAC learning for Boolean functions needs to see (nearly) all examples.
H contains enough hypotheses to classify any given set of examples in all possible
ways.
In particular, for any set of N examples, the set of hypotheses consistent with
those examples contains equal numbers of hypotheses that predict xN +1 to be
positive and hypothesishypotheses that predict xN +1 to be negative.
Idea/Problem: restrict the H in some way (but we may lose realizability)
Three Ways out of this Dilemma:
1. bring prior knowledge into the problem. (section 30.4)
2. prefer simple hypotheses. (e.g. decision tree pruning)
3. focus on “learnable subsets” of H. (next)
PAC Learning: Decision Lists

Idea: Apply PAC learning to a “learnable hypothesis space”.
Definition 28.6.6. A decision list consists of a sequence of tests, each of which is
a conjunction of literals.
If a test succeeds when applied to an example description, the decision list

specifies the value to be returned.
If the test fails, processing continues with the next test in the list.
Remark: Like decision trees, but restricted branching, but more complex tests.
Example 28.6.7 (A decision list for the Restaurant Problem).
No No
P atrons(x, Some) P atrons(x, F ull) ∧ F ri/Sat(x) No
Yes Yes
Yes Yes
Lemma 28.6.8. Given arbitrary size conditions, decision lists can represent arbi-
trary Boolean functions. (equivalent to
CNF)
This directly defeats our purpose of finding a “learnable subset” of H.

28.6. COMPUTATIONAL LEARNING THEORY 581
Decision Lists: Learnable Subsets (Size-Restricted Cases)

Definition 28.6.9. The set of decision lists where tests are of conjunctions of at
most k literals is denoted by k-DL.
Example 28.6.10. The decision list from Example 28.6.7 is in 2-DL.
Observation 28.6.11. k-DL contains k-DT, the set of decision trees of depth at
most k.
Definition 28.6.12. We denote the set of k-DT decision trees with at most
n Boolean functions with k-DT(n), and analogously k-DL(n). The language of
conjunctions of at most k literals using n attributes is written as Conj(k, n).
Decision lists are constructed of optional yes/no tests, so there are at most 3|Conj(k,n)|
distinct sets of component tests. Each of these sets of tests can be in any order, so
|k-DL(n)|≤3|Conj(k,n)| · |Conj(k, n)|!
Decision Lists: Learnable Subsets (Sample Complexity)

The number of conjunctions of k literals from n attributes is given by
k
X
2n
|Conj(k, n)| =
i=1
i
thus |Conj(k, n)|=O(nk ). Hence, we obtain (after some work)

k
log2 (nk ))
|k-DL(n)|=2O(n
Plug this into the equation for the sample complexity: N ≥ 1ϵ · (log2 ( 1δ ) + log2 (|H|))
to obtain
1 1
N ≥ · (log2 ( ) + log2 (O(nk log2 (nk ))))
ϵ δ
Intuitively: Any algorithm that returns a consistent decision list will PAC learn a
k-DL function in a reasonable number of examples, for small k.
Decision Lists Learning

Idea: Use a greedy search algorithm that repeats
1. find test that agrees exactly with some subset E of the training set,
2. add it to the decision list under construction and removes E,
3. construct the remainder of the DL using just the remaining examples,
until there are no examples left.
Definition 28.6.13. The following algorithm performs decision list learning

function DLL(E) returns a decision list, or failure
if E is empty then return (the trivial decision list) No
t := a test that matches a nonempty subset Et of E
such that the members of Et are all positive or all negative
if there is no such t then return failure
if the examples in Et are positive then o := Yes else o := No
return a decision list with initial test t and outcome o and remaining tests given by
DLL(E\Et )
Decision Lists Learning in Comparison

Learning curves: for DLL (and DTL for comparison)
1
Proportion correct on test set
0.9
0.8
Decision tree
0.7 Decision list
0.6
0.5
0.4
0 20 40 60 80 100
Training set size
Upshot: The simpler DLL works quite well!
28.7 Regression and Classification with Linear Models

//fau.tv/clip/id/30380, and https://fau.tv/clip/id/30381.
Linear Regression and Classification

We pass on to another hypothesis space: linear functions over continuous-valued
inputs.
Definition 28.7.1. We call an inductive learning problem ⟨H, f ⟩ a classification

problem, iff codom(f ) is discrete, and a regression problem if codom(f ) is con-
28.7. REGRESSION AND CLASSIFICATION WITH LINEAR MODELS 583
tinuous, i.e. non-discrete usually real valued.
Univariate Linear Regression

Definition 28.7.2. A univariate or unary function is a function with one argument.
Recall: A mapping between vector spaces is called linear, iff it preserves plus and
scalar multiplication.
Observation 28.7.3. A univariate, linear function f : R→R is of the form f (x) =

w1 x + w0 for some wi ∈R.
Definition 28.7.4. Given a vector w:=(w0 ,w1 ), we define hw (x):=w1 x + w0 .
Definition 28.7.5. Given a set of examples E ⊆ R×R, the task of finding hw that
best fits E is called linear regression.
Example 28.7.6.
1000
Examples of house price vs. square 900
House price in $1000
feet in houses sold in Berkeley in 800
July 2009. 700
Also: linear function hypothesis 600
that minimizes squared error loss 500

400
y = 0.232x + 246. 300
500 1000 1500 2000 2500 3000 3500
House size in square feet
Univariate Linear Regression by Loss Minimization
Idea: Minimize squared error loss over {(xi ,yi )|i≤N } (used already by Gauss)
N
X N
X N
X
2 2
Loss(hw ) = L2 (yj , hw (xj )) = (yj − hw (xj )) = (yj − (w1 xj + w0 ))
j=1 j=1 j=1
Task: find w∗ :=argmin Loss(hw ).

w
PN 2
Recall: j=1 (yj − (w1 xj + w0 )) is minimized, when the partial derivatives wrt.
the wi are zero, i.e. when
N N
∂ X 2 ∂ X 2
( (yj − (w1 xj + w0 )) ) = 0 and ( (yj − (w1 xj + w0 )) ) = 0
∂w0 j=1 ∂w1 j=1
Observation: These equations have a unique solution:

P P P P P
N ( j xj yj ) − ( j xj )( j yj ) ( j yj ) − w1 ( j xj )
w1 = P P 2 w0 =
N ( j xj 2 ) − ( j xj ) N
Remark: Closed-form solutions only exist for linear regression, for other (dif-
ferentiable) hypothesis spaces use gradient descent methods for adjusting/learning
weights.
A Picture of the Weight Space

Remark: Many forms of learning involve adjusting weights to minimize loss.
Definition 28.7.7. The weight space is the space of all possible combinations of
weights. Loss minimization in a weight space is called weight fitting.
The weight space of univariate linear re-

gression is R2 .
; graph the loss function over R2 . Loss
Note: it is convex. w0
w1
Observation 28.7.8. The squared error loss function is convex for any linear
regression problem ; there are no local minimumlocal minima.
Gradient Descent Methods

If we do not have closed form solutions for minimizing loss, we need to search.
Idea: Use local search (hill climbing) methods.
Definition 28.7.9. The gradient descent algorithm for finding a minimum of a
continuous function f is hill climbing in the direction of the steepest descent, which
can be computed by the partial derivatives of f .
function gradient−descent(f ,w,α) returns a local minimum of f
inputs: a differentiable function f and initial weights w = (w0 ,w1 ).
loop until w converges do
for each wi do
∂
wi ←− wi − α ∂w i
(f (w))
end for
end loop
The parameter α is called the learning rate. It can be a fixed constant or it can
decay as learning proceeds.
Gradient-Descent for Loss

Let’s try gradient descent for Loss.
Work out the partial derivatives for one example (x,y):
2
∂Loss(w) ∂(y − hw (x)) ∂(y − (w1 x + w))
= = 2(y − hw (x))
∂wi ∂wi ∂wi
and thus
∂Loss(w) ∂Loss(w)
= −2(y − hw (x)) = −2(y − hw (x))x
∂w0 ∂w1
Plug this into the gradient descent updates:
w0 ←− w0 − α − 2(y − hw (x)) w1 ←− w1 − α − 2(y − hw (x))x
Gradient-Descent for Loss (continued)

Analogously for n training examples (xj ,yj ):
Definition 28.7.10.
X X
w0 ←− w0 − α( −2(yj − hw (xj ))) w1 ←− w1 − α( −2(yj − hw (xn ))xn )
j j
These updates constitute the batch gradient descent learning rule for univariate
linear regression.
Convergence to the unique global loss minimum is guaranteed (as long as we pick
α small enough) but may be very slow.
Multivariate Linear Regression

Definition 28.7.11. A multivariate or n ary function is a function with one or
more arguments.
We can use it for multivariate linear regression.

Idea: Every example ⃗xj is an n element vector and the hypothesis space is the set
of functions
X
hsw (⃗xj ) = w0 + w1 xj,1 + . . . + wn xj,n = w0 + wi xj,i
i
Trick: Invent xj,0 :=1 and use matrix notation:

X
hsw (⃗xj ) = w·⃗ ⃗ t ⃗xj =
⃗ xj = w wi xj,i
i
Definition 28.7.12. The best vector ∗

P of weights, w , minimizes squared-error loss
∗
over the examples: w :=argmin ( j L2 (yj )(w·⃗xj )).
w
Gradient descent will reach the (unique) minimum of the loss function; the update
equation for each weight wi is
X
wi ←− wi − α( xj,i (yj − hw (⃗xj )))
j
Multivariate Linear Regression (Analytic Solutions)

We can also solve analytically for the w∗ that minimizes loss.
Let ⃗y be the vector of outputs for the training examples, and X be the data
matrix, i.e., the matrix of inputs with one n-dimensional example per row.
−1
Then the solution w∗ = (Xt X) Xt ⃗y minimizes the squared error.
Multivariate Linear Regression (Regularization)

Remark: Univariate linear regression does not overfit, but in the multivariate case
there might be “redundant dimensions” that result in overfitting.
Idea: Use regularization with a complexity function based on weights.

P q
Definition 28.7.13. Complexity(hw ) = Lq (w) = i |wi |
Caveat: Do not confuse this with the loss functions L1 and L2 .
Problem: Which q should be pick? (L1 and L2 minimize sum of absolute

values/squares)
Answer: depends on the application.
Remark: L1 -regularization tends to produce a sparse model, i.e. it sets many
weights to 0, effectively declaring the corresponding attributes to be irrelevant.
Hypotheses that discard attributes can be easier for a human to understand, and
may be less likely to overfit. (see [RN03, Section 18.6.2])

Linear Classifiers with a hard Threshold

Idea: The result of linear regression can be used for classification.
Example 28.7.14 (Nuclear Test Ban Verification).
Plots of seismic data parameters: 7.5

7
body wave magnitude x1 vs. sur- 6.5
6
face wave magnitude x2 . White: 5.5
x2
5
earthquakes, black: underground 4.5
explosions 4
3.5
Also: hw∗ as a decision boundary 3
2.5
x2 = 17x1 − 4.9. 4.5 5 5.5 6 6.5 7
x1
Definition 28.7.15. A decision boundary is a line (or a surface, in higher dimen-

sions) that separates two classes of points. A linear decision boundary is called a
linear separator and data that admits one are called linearly separable.
Example 28.7.16 (Nuclear Tests continued). The linear separator for Exam-
ple 28.7.14is defined by −4.9 + 1.7x1 − x2 = 0, explosions are characterized by
−4.9 + 1.7x1 − x2 > 0, earthquakes by −4.9 + 1.7x1 − x2 < 0.
Useful Trick: If we introduce dummy coordinate x0 = 1, then we can write the
classification hypothesis as hw (x) = 1 if w·x > 0 and 0 otherwise.
Linear Classifiers with a hard Threshold (Perceptron Rule)

So hw (x) = 1 if w·x>0 and 0 otherwise is well-defined, how to choose w?
Think of hw (x) = T (w·x), where T (z) = 1, if z>0 and T (z) = 0 otherwise. We
call T a threshold function.
Problem: T is not differentiable and ∂T

∂z = 0 where defined ;
∂T
No closed-form solutions by setting ∂z = 0 and solving.
Gradient-descent methods in weight-space do not work either.
We can learn weights by iterating over the following rule:

Definition 28.7.17.Given an example (x,y), the perceptron learning rule is
wi ←− wi + α · (y − hw (x)) · xi
as we are considering 0/1 classification, there are three possibilities:
1. If y = hw (x), then wi remains unchanged.

2. If y = 1 and hw (x) = 0, then wi is in/decreased if xi is positive/negative. (we
want to make w·x bigger so that T (w·x) = 1)
3. If y = 0 and hw (x) = 1, then wi is de/increased if xi is positive/negative. (we

want to make w·x smaller so that T (w·x) = 0)
Learning Curves for Linear Classifiers (Perceptron Rule)
Example 28.7.18.
7.5
7
Learning curves (plots of total 6.5
6
training set accuracy vs. number 5.5
x2
5
of iterations) for the perceptron 4.5

4
rule on the earthquake/explosions

3.5
3
2.5
data. 4.5 5 5.5 6 6.5 7
x1
original data noisy, non-separable data learning rate decay

α(t) = 1000/(1000 + t)
1 1 1
0.9 0.9 0.9
Proportion correct
Proportion correct
Proportion correct
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 0.4 0.4
0 100 200 300 400 500 600 700 0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000
Number of weight updates Number of weight updates Number of weight updates
messy convergence convergence failure slow convergence

700 iterations 100,000 iterations 100,000 iterations
Theorem 28.7.19. Finding the minimal-error hypothesis is NP hard, but possible

with learning rate decay.
Linear Classification with Logistic Regression

So far: Passing the output of a linear function through a threshold function T
yields a linear classifier.
Problem: The hard nature of T brings problems:
T is not differentiable nor continuous ; learning via perceptron rule becomes

unpredictable.
T is “overly precise” near the boundary ⇝ need more graded judgements.
Idea: Soften the threshold, approximate it with a differentiable function.
1
We use the standard logistic function l(x) = 1+e−x
1
So we have hw (x) = l(w·x) = 1+e−(w·x)
Example 28.7.20 (Logistic Regression Hypothesis in Weight Space).

Plot of a logistic regression hypothesis for 1
the earthquake/explosion data. 0.8

0.6
The value at (w0 ,w1 ) is the probability 0.4

0.2
-2-4
of belonging to the class labeled 1. 0
-2 0 2 8 64
20x
2
x1 4 6 10
We speak of the cliff in the classifier intuitively.
Logistic Regression
Definition 28.7.21. The process of weight fitting in hw (x) = 1

1+e−(w·x)
is called
logistic regression.
There is no easy closed form solution, but gradient descent is straightforward,

As our hypotheses have continuous output, use the squared error loss function L2 .
For an example (x,y) we compute the partial derivatives: (via chain rule)
∂ ∂ 2
(L2 (w)) = (y − hw (x) )
∂wi ∂wi
∂
= 2 · hw (x) · (y − hw (x))
∂wi
∂
= −2 · hw (x) · l′ (w·x) · (w·x)
∂wi
= −2 · hw (x) · l′ (w·x) · xi
Logistic Regression (continued)

The derivative of the logistic function satisfies l′ (z) = l(z)(1 − l(z)), thus
l′ (w·x) = l(w·x)(1 − l(w·x)) = hw (x)(1 − hw (x))
Definition 28.7.22. The rule for logistic update (weight update for minimizing the
loss) is
wi ←− wi + α · (y − hw (x)) · hw (x) · (1 − hw (x)) · xi
Example 28.7.23 (Redoing the Training Curves).

original data noisy, non-separable data learning rate decay

α(t) = 1000/(1000 + t)
1 1 1
Squared error per example

0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 0.4 0.4
0 1000 2000 3000 4000 5000 0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000
Number of weight updates Number of weight updates Number of weight updates
messy convergence convergence failure slow convergence

5000 iterations 100,000 iterations 100,000 iterations
Upshot: Logistic update seems to perform better than perceptron update.
28.8 Artificial Neural networks

//fau.tv/clip/id/30383, https://fau.tv/clip/id/30384, and https://fau.tv/clip/id/30386.
Outline
Brains
Neural networks
Perceptrons
Multilayer perceptrons
Applications of neural networks
Brains
Axiom 28.8.1 (Neuroscience Hypothesis). Mental activity consists consists
primarily of electrochemical activity in networks of brain cells called neurons.
28.8. ARTIFICIAL NEURAL NETWORKS 591
Definition 28.8.2. The animal brain is a biological neural network

with 1011 neurons of > 20 types, 1014 synapses, (1ms) − (10ms) cycle time.
Signals are noisy “spike trains” of electrical potential.
Neural Networks as an approach to Artificial Intelligence
One approach to Artificial Intelligence is to model and simulate brains. (and hope
that AI comes along naturally)
Definition 28.8.3. The AI sub field of neural networks (also called connectionism,
parallel distributed processing, and neural computation) studies computing systems
inspired by the biological neural networks that constitute brains.
Neural networks are attractive computational devices, since they perform important
AI tasks most importantly learning and distributed, noise-tolerant computation –
naturally and efficiently.
Neural Networks – McCulloch-Pitts “unit”

Definition 28.8.4. An artificial neural network is a directed graph of units and
links. A link from unit i to unit j propagates the activation ai from unit i to unit
j, it has a weight wi,j associated with it.
In 1943 McCulloch and Pitts proposed a simple model for a neuron/brain.

Definition 28.8.5. A McCulloch Pitts unit first computes a weighted sum of all
inputs and then applies an activation function g to it.
Bias Weight
a0 = 1 aj = g(inj)
X w0,j
ini = wj,i aj g
wi,j inj
j
X
ai
Σ aj
ai ← g(ini ) = g( wj,i aj )
Input Input Activation Output
j Links Function Function Output Links
If g is a threshold function, we call the unit a perceptron unit, if g is a logistic

function a sigmoid perceptron unit.
A McCulloch Pitts network is a neural network with McCulloch Pitts units.
Implementing Logical Functions as Units

McCulloch Pitts units are a gross oversimplification of real neurons, but its purpose
is to develop understanding of what neural networkss of simple units can do.
Theorem 28.8.6 (McCulloch and Pitts). Every Boolean function can be imple-
mented as McCulloch Pitts networks.
Proof: by construction
P
1. Recall that ai ←− g( j wj,i aj ).
2. As for linear regression we use a0 = 1 ; w0,i as a bias weight (or intercept)
(determines the threshold)
3.
4. Any Boolean function can be implemented as a DAG of McCulloch Pitts units.
Network Structures: Feed-Forward Networks

We have models for neurons ; connect them to neural networks.
Definition 28.8.7. A neural network is called a feed forward network, if it is acyclic.

Intuition: Feed forward networks implement functions, they have no internal state.
Definition 28.8.8.Feed forward networks are usually organized in layers: a n layer
network has a partition {L0 , . . ., Ln } of the nodes, such that edges only connect
nodes from subsequent layer.
L0 is called the input layer and its members input units, and Ln the output layer
and its members output units. Any unit that is not in the input layer or the output
layer is called hidden.
Network Structures: Recurrent Networks

Definition 28.8.9. A neural network is called recurrent, iff it has cycles.
Hopfield networks have symmetric weights (wi,j = wj,i ) g(x) = sign(x), ai =
±1; (holographic associative memory)
Boltzmann machines use stochastic activation functions.
Recurrent neural networks have directed cycles with delays ; have internal state
(like flip-flops), can oscillate etc.
Single-layer Perceptrons
Definition 28.8.10. A perceptron network is a feed forward network of perceptron

units. A single layer perceptron network is called a perceptron.
Example 28.8.11.
1
0.8
0.6
0.4
0.2 -4
0 0-2
4 2 x2
Input w Output -2 0
x1
2 4 6 10 8
6
i,j
Layer Layer
All input units are directly connected to output units.
Output units all operate separately, no shared weights ; treat as the combination
of n perceptron units.
Adjusting weights moves the location, orientation, and steepness of cliff.
Feed-forward Neural Networks (Example)

Feed forward network =
b a parameterized family of nonlinear functions:
Example 28.8.12. We show two feed forward networks:
w1,3 w1,3 w3,5

1 3 1 3 5
w1,4 w1,4 w3,6
w2,3 w2,3 w4,5

2 w2,4 4 2 w2,4 4 w4,6 6
a) single layer (perceptron network) b) 2 layer feed forward network
a5 = g(w3,5 · a3 + w4,5 · a4 )
= g(w3,5 · g(w1,3 · a1 + w2,3 a2 ) + w4,5 · g(w1,4 · a1 + w2,4 a2 ))
Idea: Adjusting weights changes the function: do learning this way!
Expressiveness of perceptrons
Consider a perceptron with g = step function (Rosenblatt, 1957, 1960)
Can represent AND, OR, NOT, majority, etc., but not XOR (and thus no adders)
Represents a linear separator in input space:
X
wj xj > 0 or W, x· > 0
j
x1 x1 x1
1 1 1
0 0 0
0 1 x2 0 1 x2 0 1 x2
(a) x1 and x2 (b) x1 or x2 (c) x1 xor x2
Minsky & Papert (1969) pricked the first neural network balloon!
Perceptron Learning
Idea: Wlog. treat only single-output perceptrons ; w is a “weight vector”.
Learn by adjusting weights in w to reduce generalization loss on training set.
Let us compute with the squared error loss of a weight vector w for an example
(x,y).
2
Loss(w) = Err2 = (y − hw (x))
Perform optimization search by gradient descent for any weight wi :

Xn
∂Loss(w) ∂Err ∂
= 2 · Err · = 2 · Err · (y − g( wj xj ))
∂wj ∂wj ∂wj j=0
= −2 · Err · g ′ (inj ) · xj
Simple weight update rule:
wj,k ← wj,k + α · Err · g ′ (inj ) · xj
Perceptron learning contd.

Perceptron learning rule converges to a consistent function
for any linearly separable data set
1 1

0.9 0.9
0.8 0.8
0.7 0.7
0.6 Perceptron 0.6

Decision tree
0.5 0.5 Perceptron
Decision tree
0.4 0.4
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Training set size Training set size
Perceptron learns majority function easily, DTL is hopeless.

DTL learns restaurant function easily, perceptron cannot represent it.
Multilayer perceptrons
Definition 28.8.13. In multi layer perceptron (MLPs), layers are usually fully
connected;
numbers of hidden units typically chosen by hand.
Output Layer ai
wi,j
Hidden Layer aj
wi,j
Input Layer ak
Expressiveness of MLPs
All continuous functions w/ 2 layers, all functions w/ 3 layers.
hW(x1, x2) hW(x1, x2)

1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 4 0.2 4
0 2 0 2
-4 -2 0 x2 -4 -2 0 x2
0 -2 0 -2
x1 2 4 -4 x1 2 4 -4
Combine two opposite-facing threshold functions to make a ridge.

Combine two perpendicular ridges to make a bump.
Add bumps of various sizes and locations to fit any surface.
Proof requires exponentially many hidden units. (cf. DTL proof)
Learning in Multilayer Networks (Output Layer)

Idea: Learn by adjusting weights to reduce error on training set.
Problem: neural networks have multiple outputs.
Idea: We use hw with output vector y.
Observation: The squared error loss of a weight matrix w for an example (x,y) is
n
X
2 2
Loss(w) = ∥(y − hw (x))∥2 = (yk − ak )
k=1
Output layer: Analogous to that for single-layer perceptron, but multiple output
units
wj,i ← wj,i + α · aj · ∆i
where ∆i = Erri · g ′ (ini ) and Err = y − hw (x). (error vector)
Learning in Multilayer Networks (Hidden Layers)

Problem: The error Err is well-defined only for the output layer. ⇝ The examples
do not say anything about the hidden layers.
Idea: Back-propagate the error from the output layer; actually back-propagate
∆k .
The hidden node j is “responsible” for some fraction of ∆k .(by connection weight)
Definition 28.8.14. The back propagation rule for hidden nodes of a multilayer
perceptron is X
∆j ← g ′ (inj ) · ( wj,i ∆i )
i
Update rule for weights in hidden layer:
wk,j ← wk,j + α · ak · ∆j
Remark: Most neuroscientists deny that back-propagation occurs in the brain.

Back-Propagation Process
The back-propagation process can be summarized as follows:
1. Compute the ∆ values for the output units, using the observed error.
2. Starting with output layer, repeat the following for each layer in the network, until
the earliest hidden layer is reached:
(a) Propagate the ∆ values back to the previous (hidden) layer.
(b) Update the weights between the two layers.
Details (algorithm) later.
Backprogagation Learning Algorithm

Definition 28.8.15. The back propagation learning algorithm is given the following
pseudocode
function BACK−PROP−LEARNING(examples,network) returns a neural network
inputs: examples, a set of examples, each with input vector x and output vector y
network, a multilayer network with L layers, weights wi,j , activation function g
local variables: ∆, a vector of errors, indexed by network node
foreach weight wi,j in network do
wi,j := a small random number
repeat
foreach example (x, y) in examples do
/∗ Propagate the inputs forward to compute the outputs ∗/
foreach node i in the input layer do ai := xi
for l = 2 to L do
foreach node
P j in layer l do
inj := i wi,j ai
aj := g(inj )
/∗ Propagate deltas backward from output layer to input layer ∗/
foreach node j in the output layer do ∆[j] := actf un!′ (inj ) · (yj − aj )
for l = L − 1 to 1 do
foreach node i in layer l do ∆[i] := actf un!′ (ini ) · ( j wi,j ∆[j])
P
/∗ Update every weight in network using deltas ∗/

foreach weight wi,j in network do wi,j := wi,j + α · ai · ∆[j]
until some stopping criterion is satisfied
return network
Back-Propagation Derivation from First Principles

This is very similar to the gradient calculation for logistic regression
Compute the loss gradient wrt. the weights between the output and hidden layers:
∂Lossk ∂ak ∂g(ink )

= −2(yk − ak ) = −2(yk − ak )
∂wj,k ∂wj,k ∂wj,k
∂in k
= −2(yk − ak )g ′ (ink )
∂wj,k
∂ X
= −2(yk − ak )g ′ (ink ) ( wj,k aj )
∂wj,k j
= −2(yk − ak )g ′ (ink )aj = −2aj ∆k
Back-propagation derivation contd.

we continue computing where we left off
∂Lossk ∂ak ∂g(ink )

= −2(yk − ak ) = −2(yk − ak )
∂wi,j ∂wi,j ∂wi,j
∂ink ∂ X
= −2(yk − ak )g ′ (ink ) = −2∆k ( wj,i aj )
∂wi,j ∂wi,j j
∂aj ∂g(inj )
= −2∆k wj,k = −2∆k wj,k
∂wi,j ∂wi,j
∂inj
= −2∆k wj,k g ′ (inj )
∂wi,j
∂ X
= −2∆k wj,k g ′ (inj ) ( wi,j ak )
∂wi,j
k
= −2∆k wj,k g ′ (inj )ak = −ak ∆j
Back-Propagation – Properties
At each epoch, sum gradient updates for all examples and apply.
Training curve for 100 restaurant examples: finds exact fit.
14
12
Total error on training set
10
8
6
4
2
0
0 50 100 150 200 250 300 350 400
Number of epochs
Typical problems: slow convergence, local minima.
Back-Propagation – Properties (contd.)

Example 28.8.16. Learning curve for MLPs with 4 hidden units:
Proportion correct on test set 1
0.9
0.8
0.7
0.6 Decision tree

Multilayer network
0.5
0.4
0 10 20 30 40 50 60 70 80 90 100
Training set size
Experience shows: MLPs are quite good for complex pattern recognition tasks,
but resulting hypotheses cannot be understood easily.
This makes MLPs ineligible for some tasks, such as credit card and loan approvals,
where law requires clear unbiased criteria.
Handwritten digit recognition
400–300–10 unit MLP = 1.6% error
LeNet: 768–192–30–10 unit MLP = 0.9% error
Current best (kernel machines, vision algorithms) ≈ 0.6% error
Summary
Most brains have lots of neurons; each neuron ≈ linear–threshold unit (?)
Perceptrons (one-layer networks) insufficiently expressive

Multi-layer networks are sufficiently expressive; can be trained by gradient descent,
i.e., error back propagation
Many applications: speech, driving, handwriting, fraud detection, etc.

Engineering, cognitive modelling, and neural system modelling subfields have largely
diverged
XKCD on Machine Learning

A Skepticists View: see https://xkcd.com/1838/
28.9 Support Vector Machines

Support Vector Machines

Definition 28.9.1. Support-vector machines (SVMs also support-vector networks)
are supervised learning models for classification and regression. SVMs
construct a maximum margin separator, i.e. a decision boundary with the largest
possible distance to example points. This helps them generalize well.
can embed data into a higher-dimensional space, where it is linearly separable
by the kernel trick ; the separating hyperplane is a hyper-surface in original
data.
28.9. SUPPORT VECTOR MACHINES 601
prioritize critical examples (support vectors). (better generalization)

Currently the most popular approach for “off-the-shelf” supervised learning.
Support Vector Machines (Separation with Margin)

Definition 28.9.2. Given a linearly separable data set E the maximum margin
separator is the linear separator s that maximizes the margin, i.e. the distance of
the E from s.
Example 28.9.3. All lines on the left are valid linear separators:
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
We expect the maximum margin separator on the right to generalize better!

Idea: Minimize the expected generalization loss instead of the empirical loss.
Finding the Maximum Margin Separator

Before we see how to find the maximum margin separator, . . .
We have a training {(x1 ,y1 ), . . . , (xn ,yn )} where

yi ∈{ − 1, 1} (instead of {1, 0})
p
xi ∈R (multi-linear classification)
We want to find a hyperplane that maximally separates the points xi with yi = −1
from those with yi = 1.
Recall: Any hyperplane (in particular anylinear separator) is represented as the
set {x|(w·x) + b = 0}, where w is the (not necessarily normalized) normal vector
the hyperplane.
b
The parameter ∥w∥ determines the offset of the hyperplane from the origin along
2
the normal vector w.
Idea: Use gradient descent to search the space of all w and b for maximizing
combinations. (works, but SVMs follow a different route)
Finding the Maximum Margin Separator (Separable Case)

Idea: The margin is bounded by the two hyperplanes described by (w·x) + b = −1
(lower boundary) and (w·x) + b = 1 (upper boundary).
2
The distance between them is ∥w∥ .
2
; to maximize the margin, minimize ∥w∥2 while keeping xi out of the margin.
Constraints: (w·xi ) + b≥1 for yi = 1 and (w·xi ) + b≤ − 1 for yi = −1
or simply yi ((w·xi ) − b)≥1 for 1≤i≤n.
Optimization Problem: Minimize ∥w∥2 while yi ((w·xi ) − b)≥1 for 1≤i≤n.
Finding the Maximum Margin Separator (Separable Case)

After a bit of mathematical magic (solving for the Lagrangian dual) we get
Alternative Representation: Find the optimal solution by solving the SVM equa-
tion X 1 X
argmax ( αj − ( αj αk yj yk (xj ·xk )))
α
j
2
j,k
P
under the constraints αj ≥0 and j αj yj = 0.
Observations: This equation has three important properties:
1. The expression is convex ; the single global maximum can found efficiently.
2. Data enter the expression only in the form of dot products of point pairs; once
the optimal αi have been calculated, we have
X
h(x) = sign( αj yj (x·xj ) − b)
j
3. The weights αj associated with each data point are zero except at the support
vectors the points closest to the separator.
There are good software packages for solving such quadratic programming opti-
mizations
P
Once we found an optimal vector α, use w = j αj xj .

28.9. SUPPORT VECTOR MACHINES 603
Support Vector Machines (Kernel Trick)

Problem: What if the data is not linearly separable?
Idea: Transform the data into a higher-dimensional space, where they are.
Example 28.9.4 (Projecting Up a Non-Separable Data Set).
left: The true decision boundary is x1 2 + x2 2 ≤1.
1.5
√2x1x2
1
3
2
0.5
1
0
x2
0
-1
-2 2.5
-0.5 -3 2
0 1.5
-1 0.5
1 1 x22
1.5 0.5
-1.5 x21 2
-1.5 -1 -0.5 0 0.5 1 1.5
x1
√
right: mapping into a three-dimensional input space ⟨x1 2 , x2 2 , 2x1 x2 ⟩ ; separa-
ble by a hyperplane.
Upshot:
√ We map each input vector x to a F (x) with f1 = x1 , f2 = x2 , and
2 2
f3 = 2x1 x2 .
Support Vector Machines (Kernel Trick continued)
Idea: Replace xj ·xj by F (xj )·F (xj ) in the SVM equation.(compute in high dim
space.)
Often we can compute F (xj )·F (xj ) without computing F everywhere.
√ 2
Example 28.9.5. If F (x) = ⟨x1 2 , x2 2 , 2x1 x2 ⟩, then F (xj )·F (xj ) = (xj ·xj )
√
(have added the 2 in F so that this works)
2
We call the function (xj ·xj ) a kernel function. (there are others; next)
Definition 28.9.6. Let X be a nonempty set, sometimes referred to as the in-

dex set. A symmetric
Pn function K : X×X→R is called a (positive definite) kernel
function on X, iff i,j=1 ci cj K(xi , xj )≥0 for any xi ∈X, n∈N, and ci ∈R.
Support Vector Machines (Kernel Trick continued)

Generally: We can learn non-linear separators by solving

X 1 X
argmax ( αj − ( αj αk yj yk K(xj , xk )))
α
j
2
j,k
where K is a kernel function

d
Definition 28.9.7. The function K(xj , xk ) = (1 + (xj ·xj )) is a kernel function
corresponding to a feature space whose dimension is exponential in d. It is called
the polynomial kernel.
Theorem 28.9.8 (Mercer’s Theorem). Every kernel function K where K(xj , xk )
is positive definite corresponds to some feature space.
Summary of Inductive Learning

Learning needed for unknown environments, lazy designers.
Learning agent = performance element + learning element.
Learning method depends on type of performance element, available feedback, type
of component to be improved, and its representation.
For supervised learning, the aim is to find a simple hypothesis that is approximately
consistent with training examples
Decision tree learning using information gain.
Learning performance = prediction accuracy measured on test set
PAC learning as a general theory of learning boundaries.

Linear regression (hypothesis space of univariate linear functions).
Linear classification by linear regression with hard and soft thresholds.

Chapter 29
Statistical Learning
Part V we learned how to reason in non-deterministic, partially observable environments by

quantifying uncertainty and reasoning with it. The key resource there were probabilistic models
and their efficient representations: Bayesian networks.
Part V we assumed that these models were given, perhaps designed by the agent developer.
We will now learn how these models can – at least partially – be learned from observing the
environment.
Statistical Learning: Outline

Bayesian learning, i.e. learning probabilistic models (e.g. Bayesian networks) from
observations.
Maximum a posteriori and maximum likelihood learning
Bayes network learning
ML Parameter Learning with Complete Data
Linear regression
Naive Bayes Models/Learning
29.1 Full Bayesian Learning

The Candy Flavors Example

Example 29.1.1. Suppose there are five kinds of bags of candies:
1. 10% are h1 : 100% cherry candies

2. 20% are h2 : 75% cherry candies + 25% lime candies
5. 10% are h5 : 100% lime candies
605
606 CHAPTER 29. STATISTICAL LEARNING
Then we observe candies drawn from some bag:
What kind of bag is it? What flavour will the next candy be?
Candy Flavors: Posterior probability of hypotheses

Example 29.1.2. The probability of hypothesis hi after n limes are observed =
b
Posterior probability of hypothesis
1 P(h1 | d)
P(h2 | d)
0.8 P(h3 | d)
P(h4 | d)
P(h5 | d)
0.6
0.4
0.2
0
0 2 4 6 8 10
Number of observations in d
Q
if the observation are IID, i.e. P (d|hi ) = j P (dj |hi ) and the hypothesis prior is
as advertised. (e.g. P (h3 |d) = 0.510 = 0.1%)
The posterior probabilities start with the hypothesis priors, change with data.
Candy Flavors: Prediction Probability

We calculate that the n + 1-th candy is lime:
X
P(dn+1 = lime|d) = P(dn+1 = lime|hi ) · P (hi |d)
i
29.1. FULL BAYESIAN LEARNING 607
Probability that next candy is lime

1
0.9
0.8
0.7
0.6
0.5
0.4
0 2 4 6 8 10
Number of observations in d
Full Bayesian Learning

Idea: View learning as Bayesian updating of a probability distribution over the
hypothesis space:
H is the hypothesis variable with values h1 , h2 , . . . and prior P(H).
jth observation dj gives the outcome of random variable Dj .
d:=d1 , . . . , dN constitutes the training set of a inductive learning problem.
Definition 29.1.3. Bayesian learning calculates the probability of each hypothesis

and makes predictions based on this:
Given the data so far, each hypothesis has a posterior probability:
P (hi |d) = α · P (d|hi ) · P (hi )
where P (d|hi ) is called the likelihood (of the data under each hypothesis) and
P (hi ) the hypothesis prior.
Bayesian predictions use a likelihood-weighted average over the hypotheses:
X X
P(X|d) = P(X|d, hi ) · P (hi |d) = P(X|hi ) · P (hi |d)
i i
Observation: No need to pick one best-guess hypothesis for Bayesian predictions!

(and that is all an agent cares about)
Full Bayesian Learning: Properties

Observation: The Bayesian prediction eventually agrees with the true hypothesis.
The probability of generating “uncharacteristic” data indefinitely is vanishingly
small.
Proof sketch: Argument analogous to PAC learning.

Problem: Summing over the hypothesis space is often intractable.
6
Example 29.1.4. There are 22 = 18, 446, 744, 073, 709, 551, 616 Boolean func-
tions of 6 arguments.
Solution: Approximate the learning methods to simplify.
29.2 Approximations of Bayesian Learning

Maximum A Posteriori (MAP) Approximation

Goal: Get rid of summation over the space of all hypotheses in predictions.
Idea: Make predictions wrt. the “most probable hypothesis”!
Definition 29.2.1. For maximum a posteriori learning (MAP learning) choose the
MAP hypothesis hMAP that maximizes P (hi |d).
I.e., maximize P (d|hi ) · P (hi ) or (even better) log2 (P (d|hi )) + log2 (P (hi )).
Predictions made according to a MAP hypothesis hMAP are approximately Bayesian
to the extent that P(X|d) ≈ P(X|hMAP ).
Example 29.2.2. In our candy example, hMAP = h5 after three limes in a row
a MAP learner then predicts that candy 4 is lime with probability 1.
compare with Bayesian prediction of 0.8. (see prediction curves above)
As more data arrive, the MAP and Bayesian predictions become closer, because the
competitors to the MAP hypothesis become less and less probable.
For deterministic hypotheses, P (d|hi ) is 1 if consistent, 0 otherwise
; MAP = simplest consistent hypothesis. (cf. science)
Remark: Finding MAP hypotheses is often much easier than Bayesian learning,
because it requires solving an optimization problem instead of a large summation
(or integration) problem.
Digression From MAP-learning to MDL-learning

Idea: Reinterpret the log terms log2 (P (d|hi )) + log2 (P (hi )) in MAP learning:
Maximizing P (d|hi ) · P (hi ) =
b minimizing −log2 (P (d|hi )) − log2 (P (hi )).
b number of bits to encode data given hypothesis.
−log2 (P (d|hi )) =
b additional bits to encode hypothesis.
−log2 (P (hi )) = (section 28.4)
29.3. PARAMETER LEARNING FOR BAYESIAN NETWORKS 609
Indeed if hypothesis predicts the data exactly – e.g. h5 in candy example – then
log2 (1) = 0 ; preferred hypothesis.
This is more directly modeled by the following approximation to Bayesian learning:
Definition 29.2.3. In minimum description length learning (MDL learning) the

MDL hypothesis hMDL minimizes the information entropy of the hypothesis likeli-
hood.
Maximum Likelihood (ML) approximation
Observation: For large data sets, the prior becomes irrelevant. (we might not
trust it anyways)
Idea: Use this to simplify learning.
Definition 29.2.4. Maximum likelihood learning (ML learning): choose the ML
hypothesis hML maximizing P (d|hi ). (simply get the best fit to the data)
Remark: ML learning = b MAP learning for a uniform prior. (reasonable if all
hypotheses are of the same complexity)
ML learning is the “standard” (non Bayesian) statistical learning method.
29.3 Parameter Learning for Bayesian Networks

ML Parameter Learning in Bayesian Nets

Example 29.3.1. Bag from a new manufacturer; fraction θ of cherry candies?
P (F = cherry)
θ
Flavor
New Facet: Any θ is possible: continuum of hypotheses hθ

θ is a parameter for this simple (binomial) family of models.
Suppose we unwrap N candies, c cherries and ℓ = N − c limes.

Lemma 29.3.2. These are IID observations, so the likelihood is
N
Y ℓ
P (d|hθ ) = P (dj |hθ ) = θc · (1 − θ)
j=1
Trick: When optimizing a product, optimize the logarithm instead! (log2 (!) is
monotone and turns products into sums)
Definition 29.3.3. The log likelihood is just the binary logarithm of the likelihood.
L(d|h):=log2 (P (d|h))
ML Parameter Learning in Bayes Nets

Compute the log likelihood as (using Lemma 29.3.2)
L(d|hθ ) = log2 (P (d|hθ ))

N
X
= log2 (P (dj |hθ ))
j=1
= clog2 (θ) + ℓlog2 (1 − θ)
Maximize this w.r.t. θ

∂ c ℓ
(L(d|hθ )) = − =0
∂θ θ 1−θ
c c
;θ= c+ℓ = N
In English: hθ asserts that the actual proportion of cherries in the bag is equal to
the observed proportion in the candies unwrapped so far!
Seems sensible, but causes problems with 0 counts!
Question: Haven’t we done a lot of work to obtain the obvious?

Answer: So far yes, but this is a general method of broad applicability!
ML Learning for Multiple Parameters in Bayesian Networks

Cooking Recipe:
1. Write down an expression for the likelihood of the data as a function of the
parameter(s).
2. Write down the derivative of the log likelihood with respect to each parameter.
3. Find the parameter values such that the derivatives are zero

29.3. PARAMETER LEARNING FOR BAYESIAN NETWORKS 611
Multiple Parameters Example

Example 29.3.4. Red/green wrapper depends probabilistically on flavour:
P (F = cherry)
θ
Flavor
F P (W = red|F )
cherry θ1
lime θ2
Wrapper
Likelihood for, e.g., cherry candy in green wrapper:
P (F = cherry, W = green|hθ,θ1 ,θ2 )

= P (F = cherry|hθ,θ1 ,θ2 ) · P (W = green|F = cherry, hθ,θ1 ,θ2 )
= θ · (1 − θ1 )
Ovservation: For N candies, rc red-wrapped cherry candies, etc. we have

ℓ gc gℓ
P (d|hθ,θ1 ,θ2 ) = θc · (1 − θ) · θ1 rc · (1 − θ1 ) · θ2 rℓ · (1 − θ2 )
Multiple Parameters Example (contd.)

Minimize the log likelihood:
L = clog2 (θ) + ℓlog2 (1 − θ)

+ rc log2 (θ1 ) + gc log2 (1 − θ1 )
+ rℓ log2 (θ2 ) + gℓ log2 (1 − θ2 )
Derivatives of L contain only the relevant parameter:

∂L c ℓ c
∂θ = θ − 1−θ = 0 ; θ= c+ℓ
rc gc
∂L
∂θ1 = θ1 − 1−θ1 = 0 ; θ1 = rcr+g
c
c
rℓ gℓ
∂L
∂θ2 = θ2 − 1−θ2 = 0 ; θ2 = rℓr+g
ℓ
ℓ
Upshot: With complete data, parameters can be learned separately in Bayesian

networks.
Remaining Problem: Have to be careful with zero values! (division by zero)
Example: Linear Gaussian Model

1
0.8
P(y |x)
4 0.6
3.5
y
3
2.5 0.4
2
1.5
1 1
0.5 0.8 0.2
0 0.6
0 0.2 0.4 y
0.4 0.6 0.2
0.8 0 0
x 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
(y−(θ1 x+θ2 ))2

1
Maximizing P (y|x) = √2πσ e− 2σ 2 w.r.t. θ1 , θ2
PN
= minimizing E = j=1 (yj (θ1 xj + θ2 ))2
That is, minimizing the sum of squared errors gives the ML solution
for a linear fit assuming Gaussian noise of fixed variance.
29.4 Naive Bayes Models

Naive Bayes Models

Definition 29.4.1. A Bayesian network in which a single cause directly influences
a number of effects, all of which are conditionally independent, given the cause is
called a naive Bayes model or Bayesian classifier.
Observation 29.4.2.
In a naive Bayes model, the full joint probability distribution can be written as
Y
P(cause|effect1 , . . ., effectn ) = α⟨effect1 , . . ., effectn ⟩·P(cause)· P(effecti |cause)
i
Note: This kind of model is called “naive” since it is often used as a simplifying
model if the effects are not conditionally independent after all.
It is also called idiot Bayes model by Bayesian fundamentalists.

In practice, naive Bayes models can work surprisingly well, even when the conditional
independence assumption is not true.
Example 29.4.3. The dentistry example is a (true) naive Bayes model.
Naive Bayes Models for Learning (continued)

Naive Bayes models are probably the most commonly used Bayesian network model
in machine learning.
29.4. NAIVE BAYES MODELS 613
The “class” variable C (which is to be predicted) is the root.

The “attribute” variables Xi are the leaves.
Observation: The Example 29.3.4 is a (true) naive Bayes model.(only one effect)
Assuming Boolean variables, the parameters are:

θ = P (c = T), θi1 = P (Xi = T|C = T), and θi2 = P (Xi = T|C = F)
then the maximum likelihood parameters can be found exactly like above.
Idea: Once trained, use this model to classify new examples, where C is unob-
served:
With observed values x1 , . . . xn , the probability of each class is given by

Y
P(C|x1 , . . . , xn ) = α · P(C) · P(xi |C)
i
A deterministic prediction can be obtained by choosing the most likely class.
Naive Bayes Models for Learning (Properties)

Naive Bayes learning turns out to do surprisingly well in a wide range of applications.
Example 29.4.4. Learning curve for naive Bayes learning on the restaurant example
1
0.9
0.8
0.7
0.6
Decision tree
0.5 Naive Bayes
0.4
0 20 40 60 80 100
Training set size
Naive Bayes learning scales well: with n Boolean attributes, there are just
2n + 1 parameters, and no search is required to find hML .
Naive Bayes learning systems have no difficulty with noisy or missing data and can
give probabilistic predictions when appropriate.
Statistical Learning: Summary

Full Bayesian learning gives best possible predictions but is intractable.
MAP learning balances complexity with accuracy on training data.
Maximum likelihood learning assumes uniform prior, OK for large data sets:
1. Choose a parameterized family of models to describe the data.
; requires substantial insight and sometimes new models.
2. Write down the likelihood of the data as a function of the parameters.
; may require summing over hidden variables, i.e., inference.
3. Write down the derivative of the log likelihood w.r.t. each parameter.
4. Find the parameter values such that the derivatives are zero.
; may be hard/impossible; modern optimization techniques help.
Naive Bayes models as a fall back solution for machine learning:

conditional independence of all attributes as simplifying assumption.

Chapter 30
30.1 Logical Formulations of Learning

Knowledge in Learning: Motivation

Recap: Learning from examples. (last chapter)
Idea: Construct a function with the input/output behavior observed in data.

Method: Search for suitable functions in the hypothesis space. ( e.g. decision
trees)
Observation 30.1.1. Every learning task begins from zero. (except for the choice
of hypothesis space)
Problem: We have to forget everything before we can learn something new.

Idea: Utilize prior knowledge about the world! (represented e.g. in logic)
A logical Formulation of Learning

Recall: Examples are composed of descriptions (of the input sample) and classi-
fications.
Idea: Represent examples and hypotheses as logical formulae.
Example 30.1.2. For attribute based representations, we can use PL1 : we use
predicate constants for Boolean attributes and classification and function constants
for the other attributes.
Definition 30.1.3. Logic based inductive learning tries to learn an hypothesis h that
explains the classifications of the examples given their descriptions, i.e. h, D |= C
(the explanation constraint), where
615
616 CHAPTER 30. KNOWLEDGE IN LEARNING
D is the conjunction of the descriptions, and

C the conjunction of their classifications.
Idea: We solve the explanation constraint h, D |= C for h where h ranges over
some hypothesis space.
Refinement: Use Occam’s razor or additional constraints to avoid h = C. (too

easy otherwise/boring; see below)
A logical Formulation of Learning (Restaurant Examples)

Example 30.1.4 (Restaurant Example again). Descriptions are conjunctions of
literals built up from
predicates Alt, Bar, Fri/Sat, Hun, Rain, and Res
equations about the functions Pat, Price, Type, and Est.
For instance the first example X 1 from Example 28.3.2, can be described as
Alt(X 1 ) ∧ ¬Bar(X 1 ) ∧ Fri/Sat(X 1 ) ∧ Hun(X 1 ) ∧ . . .
The classification is given by the goal predicate WillWait, in this case WillWait(X 1 )
or ¬WillWait(X 1 ).
A logical Formulation of Learning (Restaurant Tree)
Example 30.1.5 (Restaurant Example again; Tree).

The induced decision tree from Example 28.4.9
30.1. LOGICAL FORMULATIONS OF LEARNING 617
can be represented as
∀r WillWait(r) ⇔ Pat(r, Some)

∨ Pat(r, Full) ∧ Hun(r) ∧ Type(r, French)
∨ Pat(r, Full) ∧ Hun(r) ∧ Type(r, Thai) ∧ Fri/Sat(r)
∨ Pat(r, Full) ∧ Hun(r) ∧ Type(r, Burger)
Method: Construct a disjunction of all the paths from the root to the positive
leaves interpreted as conjunctions of the attributes on the path.
Note: The equivalence takes care of positive and negative examples.
Cumulative Development
Example 30.1.6. Learning from very few examples using background knowledge:
1. Caveman Zog and the fish on a stick:
2. Generalizing from one Brazilian:

Upon meeting her first Brazilian – Fernando – who speaks Portugese, Sarah
learns/generalizes that all Brazilians speak Portugese,
but not that all Brazilians are called Fernando.
3. General rules about effectiveness of antibiotics:

When Sarah – gifted in diagnostics, but clueless in pharmacology – observes a
doctor prescribing the antibiotic Proxadone for an inflamed foot, she learns/infers
that Proxadone is effective against this ailment.
Observation: The methods/algorithms from section 28.2 cannot replicate this.
(why?)
Missing Piece: The background knowledge!
Problem: To use background knowledge, need a method to obtain it. (use

learning)
Question: How to use knowledge to learn more efficiently?
Answer: Cumulative development: collect knowledge and use it in learning!
Prior
Knowledge
Logic based
Observations Hypotheses Predictions
inductive learning
Definition 30.1.7. We call the body of knowledge accumulated by (a group of)

agents their background knowledge. It acts as prior knowledge in logic based learning
processes.
Adding Background Knowledge to Learning: Overview

Explanation based learning (EBL)
Relevance based learning (RBL)
Knowledge based inductive learning (KBIL)
Explanation-based Learning
Idea: Use explanation of success to infer a general rule.
Example 30.1.8 (Caveman Zog). Cavemen generalize by explaining the success

of the pointed stick: it supports the lizard while keeping hand away from fire.
From this explanation, they can infer a general rule: any long, rigid, sharp object
can be used to toast small, soft-bodied edibles.
Definition 30.1.9. Explanation based learning (EBL) refines the explanation con-
straint to the EBL constraints:
Hypothesis ∧ Descriptions |= Classif ications

Background |= Hypothesis
Intuition: Converting first-principles theories into useful, special purpose knowl-

edge.
Observation: General rule follows logically from the background knowledge.

30.1. LOGICAL FORMULATIONS OF LEARNING 619
Relevance-based Learning
Idea: Use the prior knowledge to determine the relevance of a set of features to
the goal predicate. (reduce the hypothesis space to the relevant ones)
Example 30.1.10. In a given country most people speak the same language, but
do not have the same name:
Definition 30.1.11. Relevance based learning (RBL) refines the explanation con-
straint to the RBL constraints:

Background ∧ Descriptions ∧ Classif ications |= Hypothesis
The second constraint only allows hypotheses that are relevant.
Deductive Learning
Definition 30.1.12. We call a procedure a deductive learning algorithm, if it makes
use of the observations, but does not produce hypothesis beyond the background
knowledge and the observations.
Example 30.1.13. EBL and RBL are deductive learning processes.

Problem: Deductive learning processes do not learn anything factually new!
The cavemen could have learned without Zog by thinking (Zog did!)
Idea: Replace the explanation constraint by something stronger.
Three Principal Modes of Inference

Definition 30.1.14. Deduction =
b knowledge extension
rains ⇒ wet_street rains
Example 30.1.15. D
wet_street
Definition 30.1.16. Abduction =
b explanation
rains ⇒ wet_street wet_street
Example 30.1.17. A
rains
Definition 30.1.18. Induction =
b learning general rules from examples
wet_street rains
Example 30.1.19. I
rains ⇒ wet_street

Knowledge-based Inductive Learning

Idea: Background knowledge and new hypothesis combine to explain the examples.
Example 30.1.20. Inferring disease D from the symptoms is not enough to explain
the prescription of medicine M .
Need a new general rule: M is effective against D (induction from example)
Definition 30.1.21. Knowledge based inductive learning (KBIL) replaces the ex-
planation constraint by the KBIL constraint:
Background ∧ Hypothesis ∧ Descriptions |= Classif ications
Inductive Logic Programming

Definition 30.1.22. Inductive logic programming (ILP) is logic based inductive
learning method that uses logic programming as a uniform representation for exam-
ples, background knowledge and hypotheses.
Given an encoding of the known background knowledge and a set of examples repre-
sented as a logical knowledge base of facts, an ILP system will derive a hypothesised
logic program which entails all the positive and none of the negative examples.
Main field of study for KBIL algorithms.
Prior knowledge plays two key roles:

1. The effective hypothesis space is reduced to include only those theories that are
consistent with what is already known.
2. Prior knowledge can be used to reduce the size of the hypothesis explaining the
observations.
Smaller hypotheses are easier to find.
Observation: ILP systems can formulate hypotheses in first-order logic.
; Can learn in environments not understood by simpler systems.
30.2 Explanation-Based Learning

Explanation-Based Learning
Intuition: EBL =
b Extracting general rules from individual observations.
Example 30.2.1. Differentiating and simplifying algebraic expressions
30.2. EXPLANATION-BASED LEARNING 621
1. Differentiate X 2 with respect to X to get 2X.

2. Logical reasoning system ask(Deriv(X 2 , X) = d, KB) with solution d = 2X.
3. Solving this for the first time using standard rules of differentiation gives 1 × (2 ×
(X 2−1 )).
4. This takes a first-time program 136 proof steps with 99 dead end branches.
Idea: Use memoization:
Speed up by saving the results of computation.
Create a database of input/output pairs.
Creating general rules

Memoization in explanation-based learning
Create general rules that cover an entire class of cases
Example 30.2.2. Extract the general rule ArithVar(u) ⇒ Deriv(u2 , u) = 2u.

Once something is understood, it can be generalized and reused in other circum-
stances.
Civilization advances by extending the number of important operations that
we can do without thinking about them. (Alfred North Whitehead)
Explaining why something is a good idea is much easier than coming up with the
idea in the first place:
Watch caveman Zog roast his lizard vs. thinking about putting the fish on a
stick.
Extracting rules from examples

Basic idea behind EBL:
1. Construct an explanation of the observation using prior knowledge.

2. Establish a definition of the class of cases for which the same explanation can be
used.
Example 30.2.3. Simplifying 1 × (0 + X) using a knowledge base with the
following rules:
Rewr(u, v) ∧ Simpl(v, w) ⇒ Simpl(u, w)

prim(u) ⇒ Simpl(u, u)
ArithVar(u) ⇒ prim(u)
Num(u) ⇒ prim(u)
Rewr(1 × u, u)
Rewr(0 + u, u)
...
Proof Tree for the Original Problem
Simpl(1 × (0 + X), w)
Rewr(1 × (0 + X), v) Simpl(0 + X, w)

[0 + X/v]
yes Rewr(0 + X, v ′ ) Simpl(X, w)
[0 + X/v]
yes prim(X)
ArithVar(X)
yes
Generalized Proof Tree

Simpl(x × (y + z), w)
Rewr(x × (y + z), v) Simpl(y + z, w)

[y + x/v]
yes Rewr(y + z, v ′ ) Simpl(z, w)
[y + x/v]
yes prim(z)
ArithVar(z)
yes
Generalizing proofs in EBL

The variabilized proof proceeds using exactly the same rule applications.
This may lead to variable instantiation.
30.2. EXPLANATION-BASED LEARNING 623
Example 30.2.4. Take the leaves of the generalized proof tree to get the general
rule
Rewr(1 × (0 + z), 0 + z) ∧ Rewr(0 + z, z) ∧ ArithVar(z) ⇒ Simpl(1 × (0 + z), z)
The first two conditions are true independently of z, so this becomes
ArithVar(z) ⇒ Simpl(1 × (0 + z), z)
Recap:
Use background knowledge to construct a proof for the example.
In parallel, construct a generalized proof tree.
New rule is the conjunction of the leaves of the proof tree and the variabilized
goal.
Drop conditions that are true regardless of the variables in the goal.
Improving Efficiency of EBL

Idea: Pruning the proof tree to get more general rules.
Example 30.2.5.
prim(z) ⇒ Simpl(1 × (0 + z), z)
Simpl(y + z, w) ⇒ Simpl(1 × (y + z), w)
Problem: Which rules to choose?
Adding large numbers of rules to the knowledge base slows down the reasoning
process (increases the branching factor of the search space).
To compensate, the derived rules must offer significant speed increases.
Derived rules should be as general as possible to apply to the largest possible
set of cases.
Improving efficiency in EBL (continued)

Operationality of subgoals in the rule:
A subgoal must be “easy” to solve.
prim(z) is easy to solve, but Simpl(y + z, w) leads to an arbitrary amount of
inference.
Keep operational subgoals and prune the rest of the tree.
Trade-off between operationality and generality:
More specific subgoals are easier to solve but cover fewer cases.
How many steps are still called operational?

Cost of a subgoal depends on the rules in the knowledge base.
Maximizing the efficiency of an initial knowledge base is a complex optimization
problem.
Improving efficiency of EBL (Analysis)

Empirical analysis of efficiency:
Average-case complexity on a population of problems that needs to be solved.
By generalizing from past example problems, EBL makes the knowledge base more
efficient for the kind of problems that it is reasonable to expect.
Works if the distribution of past problems is roughly the same as for future
problems.
Can lead to great improvements
Swedish to English translator was made 1200 times faster by using EBL [SR91].
30.3 Relevance-Based Learning

Recap: Relevance-based Learning

The prior knowledge concerns the relevance of a set of features to the goal predicate.
Example 30.3.1. In a given country most people speak the same language, but
do not have the same name

Deductive learning: Makes use of the observations, but does not produce hypothesis
beyond the background knowledge and the observations.
Relevance-based Learning: Determinations

Example 30.3.2 (Background knowledge in Brazil).
∀x, y, n, l Nationality(x, n) ∧ Nationality(y, n) ∧ Language(x, l) ⇒ Language(y, l)

30.3. RELEVANCE-BASED LEARNING 625
So
Nationality(F ernando, Brazil) ∧ Language(F ernando, P ortuguese)
entails
∀x Nationality(x, Brazil) ⇒ Language(x, P ortugese)
Special syntax: Nationality(x, n)≻Language(x, l)
Definition 30.3.3. If ∀v, w ∀x, y P (x, v) ∧ P (y, v) ∧ Q(x, w) ⇒ Q(), then we

say that P determines Q and write P ≻Q; we call this formula a determination or
functional dependency.
Here x and y range over all examples; v and w range over the possible values of
attributes P and Q, respectively.
Intuition: If we know the values of P and Q for one example x, e.g., P (x, a) and
Q(x, b), we can use the determination P ≻Q and to infer ∀y P (y, a) ⇒ Q(y, b).
Determining the Hypothesis Space

Determinations limit the hypothesis space.
Only consider the important features (i.e. not day of the week, hair style of
David Beckham).
Determinations specify a sufficient basis vocabulary from which to construct hy-
potheses.
Reduction of the hypothesis space makes it easier to learn the target predicate:
Learning Boolean functions of n variables in CNF: Size of the hypothesis space

n
#(H) = O(22 ).
For Boolean functions log2 (#(H)) examples are needed in a #(H) size hypoth-
esis space: Without restrictions, this is O(2n ) examples.
If the determination contains d predicates on the left, only O(2d ) examples are
needed.
Reduction of size by O(2n−d ).
Learning Relevance Information

Observation: Prior knowledge also needs to be learned.
Idea: Learning algorithms for determinations:
Find the simplest determination consistent with the observations.

A determination P ≻Q says: if examples match P they must also match Q.
Definition 30.3.4. A determination P ≻Q is consistent with a set of examples E,

if every pair in E that matches on the predicates in P also matches on the target
predicate.
A consistent determination P ≻Q is minimal, iff there is no consistent determination
P ′ ≻Q with fewer atoms in P ′ .
Learning relevance information

Example 30.3.5.
Sample Mass Temp Material Size Conductance

S1 12 26 Copper 3 0.59
S1 12 100 Copper 3 0.57
S2 24 26 Copper 6 0.59
S3 12 26 Lead 2 0.05
S3 12 100 Lead 2 0.04
S4 24 26 Lead 4 0.05
Minimal consistent determination (M aterial ∧ T emperature)≻Conductance
Non-minimal consistent determination (M ass∧Size∧T emperature)≻Conductance
Learning Relevance Information (Algorithm)

Definition 30.3.6. The MCD algorithm is a simple ∅-up generate and test algo-
rithm over subsets:
function MCD(E,A) returns a determination
inputs: E, a set of examples
A, a set of attributes, of size n
for i := 1, ..., n do
for each subset Ai of A of size i do
if ConsDet?(Ai ,E) then return Ai
end
end
function ConsDet?(A,E) returns a truth−value
inputs: A, a set of attributes
E, a set of examples
local variables: H, a hash table
for each example e in E do
if some h∈H has the same A−value as e but different class
then return False
store the class of e in H, indexed by the A−values of e
end
return True
30.3. RELEVANCE-BASED LEARNING 627
Complexity of the MCD Algorithm

Time complexity depends on the size of the minimal consistent determinations.
In case of p attributes and a total of n attributes, the algorithm has to search all
subsets of A of size p.

There are O( np ) = O(np ) of these, so the MCD algorithm is exponential.
Theorem 30.3.7. The general problem of finding minimal consistent determina-

tions is NP complete.
Good News: In most domains there is sufficient local structure to make p small.
Deriving Hypotheses
Given an algorithm for learning determinations, a learning agent has a way to
construct a minimal hypothesis within which to learn the target predicate.
Idea: Use decision tree learning for computing hypotheses.
Goal: Minimize size of hypotheses.
Result: Relevance based decision tree learning.
Relevance-based Decision Tree Learning

Idea: Use determinations to tune attribute selection in decision tree learning.
Definition 30.3.8. The relevance based decision tree learning algorithm (RBDTL)
first determines a relevant set of of attributes by MCD.
function RBDTL(E,A,v) returns a decision tree
return DTL(E, MCD(E,A),v)
Then uses it by “regular” DTL: (with adapted recursive call)

function DTL(examples,attributes,def ault) returns a decision tree
if examples is empty then return def ault
else if all examples have the same classification then return the classification
else if attributes is empty then return Majority(examples)
else
best := Choose−Attribute(attributes,examples)
tree := a new decision tree with root test best
for each value vi of best do
examplesi := {elements of examples with best = vi }
subtree := RBDTL(examplesi ,attributes − best, Mode(examples))
add a branch to tree with label vi and subtree subtree

return tree
Exploiting Knowledge
RBDTL simultaneously learns and uses relevance information to minimize its hy-
pothesis space.
Declarative bias
How can prior knowledge be used to identify the appropriate hypothesis space
to search for the correct target definition?
Unanswered questions: (engineering needed to make ideas practical)
How to handle noise?

How to use other kinds of prior knowledge besides determinations?
How can the algorithms be generalized to cover any first-order theory?
RBDTL vs. DTL

Observation: RBDTL does rather well on well-structured domains.
Section 19.4. Example

Learning 30.3.9. A performance
Using Relevance comparison between DTL and RBDTL on ran-
Information 787
domly generated data for a target function that depends on only 5 of 16 attributes.
1
0.9 RBDTL
DTL
0.8
0.7
0.6
0.5
0.4
0 20 40 60 80 100 120 140
Training set size
Figure 19.9 A performance comparison between D ECISION -T REE -L EARNING and

RBDTL on randomly generated
Michael Kohlhase: data2 for a target1056function that2023-02-10
Artificial Intelligence depends on only 5 of 16
attributes.
30.4 Inductive Logic Programming

such subsets; hence the algorithm is exponential in the size of the minimal determination. It
turns out that the problem is NP-complete, so we cannot expect to do better in the general
case. In most domains, however, there will be sufficient local structure (see Chapter 14 for a
definition of locally structured domains) that p will be small.
Given an algorithm for learning determinations, a learning agent has a way to construct
a minimal hypothesis within which to learn the target predicate. For example, we can combine
30.4. INDUCTIVE LOGIC PROGRAMMING 629
Inductive Logic Programming

Combines inductive methods with the power of first-order representations.
Offers a rigorous approach to the general KBIL problem.
Offers complete algorithms for inducing general, first-order theories from examples.
30.4.1 An Example
ILP: An example
General knowledge-based induction problem
Example 30.4.1 (Learning family relations from examples).

Observations are an extended family tree
mother, father and married relations
male and female properties
Target predicates: grandparent, BrotherInLaw, Ancestor
British Royalty Family Tree (not quite not up to date)

The facts about kinship and relations can be visualized as a family tree:
George Mum
Spencer Kydd Elisabeth Philipp Margaret
Diana Charles Anne Mark Andrew Sarah Edward
William Harry Peter Zara Beatrice Eugenie
Example
Descriptions include facts like

father(P hilip, Charles)
mother(M um, M argaret)
married(Diana, Charles)
male(P hilip)
female(Beatrice)
Sentences in classifcations depend on the target concept being learned (in the
example: 12 positive, 388 negative)
grandparent(M um, Charles)

¬grandparent(M um, Harry)
Goal: Find a set of sentences for hypothesis such that the entailment constraint
is satisfied.
Example 30.4.2. Without background knowledge, define grandparent in terms of
mother and father.
grandparent(x, y)⇔(∃z mother(x, z)∧mother(z, y))∨(∃z mother(x, z)∧father(z, y))∨. . .∨(∃z father(x, z)∧father(z, y))
Why Attribute-based Learning Fails

Observation: Decision tree learning will get nowhere!
To express Grandparent as a (Boolean) attribute, pairs of people need to be
objects Grandparent(⟨M um, Charles⟩).
But then the example descriptions can not be represented
F irstElementIsM otherOf Elizabeth(⟨M um, Charles⟩)
A large disjunction of specific cases without any hope of generalization to new

examples.
Generally: Attribute-based learning algorithms are incapable of learning relational
predicates.
Background knowledge
Observation: A little bit of background knowledge helps a lot.
Example 30.4.3. If the background knowledge contains
parent(x, y)⇔mother(x, y) ∨ father(x, y)

then Grandparent can be reduced to
grandparent(x, y)⇔(∃z parent(x, z) ∧ parent(z, y))
Definition 30.4.4. A constructive induction algorithm creates new predicates to

facilitate the expression of explanatory hypotheses.
Example 30.4.5. Use constructive induction to introduce a predicate parent to
simplify the definitions of the target predicates.
30.4.2 Top-Down Inductive Learning: FOIL

Top-Down Inductive Learning

Top-down learning method
Decision-tree learning: start from the observations and work backwards.
Decision tree is gradually grown until it is consistent with the observations.
Top-down learning: start from a general rule and specialize it.
Top-Down Inductive Learning: FOIL

Split positive and negative examples
Positive: ⟨George, Anne⟩, ⟨P hilip, P eter⟩, ⟨Spencer, Harry⟩
Negative: ⟨George, Elizabeth⟩, ⟨Harry, Zara⟩, ⟨Charles, P hilip⟩
Construct a set of Horn clauses with head grandfather(x, y) such that the positive
examples are instances of the grandfather relationship.
Start with a clause with an empty body ⇒grandfather(x, y).
All examples are now classified as positive, so specialize to rule out the negative
examples: Here are 3 potential additions:
1. father(x, y) ⇒ grandfather(x, y)
2. parent(x, z) ⇒ grandfather(x, y)
3. father(x, z) ⇒ grandfather(x, y)
The first one incorrectly classifies the 12 positive examples.
The second one is incorrect on a larger part of the negative examples.
Prefer the third clause and specialize to father(x, z)∧parent(z, y)⇒grandfather(x, y).

FOIL
function Foil(examples,target) returns a set of Horn clauses
inputs: examples, set of examples
target, a literal for the goal predicate
local variables: clauses, set of clauses, initially empty
while examples contains positive examples do
clause := New−Clause(examples,target)
remove examples covered by clause from examples
add clause to clauses
return clauses
FOIL
function New−Clause(examples,target) returns a Horn clause
local variables: clause, a clause with target as head and an empty body
l, a literal to be added to the clause
extendedExamples, a set of examples with values for new variables
extendedExamples := examples
while extendedExamples contains negative examples do
l := Choose−Literal(New−Literals(clause),extendedExamples)
append l to the body of clause
extendedExamples := map Extend−Example over extendedExamples
return clause
function Extend−Example(example,literal) returns a new example
if example satisfies literal
then return the set of examples created by extending example with each
possible constant value for each new variable in literal
else return the empty set
function New−Literals(clause) returns a set of possibly ‘‘useful’’ literals
function Choose−Literal(literals) returns the ‘‘best’’ literal from literals
FOIL: Choosing Literals

New-Literals: Takes a clause and constructs all possibly “useful” literals
father(x, z) ⇒ grandfather(x, y)
Add literals using predicates
Negated or unnegated
Use any existing predicate (including the goal)
Arguments must be variables
Each literal must include at least one variable from an earlier literal or from the
head of the clause
Valid: M other(z, u), M arried(z, z), grandfather(v, x)
Invalid: M arried(u, v)
Equality and inequality literals

E.g. z ̸= x, empty list
Arithmetic comparisons
E.g. x > y, threshold values
FOIL: Choosing Literals

The way New-Literal changes the clauses leads to a very large branching factor.
Improve performance by using type information:
E.g., parent(x, n) where x is a person and n is a number
Choose-Literal uses a heuristic similar to information gain.
Ockham’s razor to eliminate hypotheses.

If the clause becomes longer than the total length of the positive examples that
the clause explains, this clause is not a valid hypothesis.
Most impressive demonstration
Learn the correct definition of list-processing functions in Prolog from a small

set of examples, using previously learned functions as background knowledge.
30.4.3 Inverse Resolution

Inverse Resolution
Inverse resolution in a nutshell:
Classifications follows from Background ∧ Hypothesis ∧ Descriptions.
This can be proven by resolution
Run the proof backwards to find hypothesis
Problem: How to run the proof backwards?
Recap: In ordinary resolution we take two clauses C1 = L ∨ R1 and C2 = ¬L ∨ R2
and resolve them to produce the resolvent C = R1 ∨ R2 .
Idea: Two possible variants of inverse resolution:

Take resolvent C and produce two clauses C1 and C2 .
Take C and C1 and produce C2 .
Generating Inverse Proofs (Example)
parent(x, y) ⇒ parent(z, y) ⇒ grandparent(x, y) parent(Elizabeth, y) ⇒ grandparent(George, y)
[George/x],[Elisabeth/y]
parent(Elizabeth, y) ⇒ grandparent(George, y) T rue ⇒ parent(Elizabeth, Anne)
[Anne/y]
T rue ⇒ grandparent(George, Anne) grandparent(George, Anne) ⇒ F alse
T rue ⇒ F alse
Generating Inverse Proofs

Inverse resolution is a search algorithm: For any C and C1 there can be several or
even an infinite number of clauses C2 .
Example 30.4.6. Instead of parent(Elizabeth, y) ⇒ grandparent(George, y)
there were numerous alternatives:
parent(Elizabeth, Anne) ⇒ grandparent(George, Anne)

parent(z, Anne) ⇒ grandparent(George, Anne)
parent(z, y) ⇒ grandparent(George, y)
The clauses C1 that participate in each step can be chosen from Background,
Descriptions, Classifications or from hypothesized clauses already generated.
ILP needs restrictions to make the search manageable
Eliminate function symbols
Generate only the most specific hypotheses
Use Horn clauses
All hypothesized clauses must be consistent with each other
Each hypothesized clause must agree with the observations
New Predicates and New Knowledge

An inverse resolution procedure is a complete algorithm for learning first-order
logicfirst-order theories:
If some unknown hypothesis generates a set of examples, then an inverse reso-
lution procedure can generate hypothesis from the examples.
Can inverse resolution infer the law of gravity from examples of falling bodies?
Yes, given suitable background mathematics!
Monkey and typewriter problem: How to overcome the large branching factor and
the lack of structure in the search space?

Inverse resolution is capable of generating new predicates:
Resolution of C1 and C2 into C eliminates a literal that C1 and C2 share.
This literal might contain a predicate that does not appear in C.
When working backwards, one possibility is to generate a new predicate from
which to construct the missing literal.

Example 30.4.7.
F ather(George; y) ⇒ P (x, y) P (George; y) ⇒ Ancestor(George, y)

[George/x]
F ather(George; y) ⇒ Ancestor(George, y)
P can be used in later inverse resolution steps.

Example 30.4.8. mother(x, y) ⇒ P (x, y) or father(x, y) ⇒ P (x, y) leading to the
“Parent” relationship.
Inventing new predicates is important to reduce the size of the definition of the
goal predicate.
Some of the deepest revolutions in science come from the invention of new predi-
cates. (e.g. Galileo’s invention of
acceleration)
Applications of ILP
ILP systems have outperformed knowledge free methods in a number of domains.
Molecular biology: the GOLEM system has been able to generate high-quality
predictions of protein structures and the therapeutic efficacy of various drugs.
GOLEM is a completely general-purpose program that is able to make use of back-

ground knowledge about any domain.
Knowledge in Learning: Summary

Cumulative learning: Improve learning ability as new knowledge is acquired.
Prior knowledge helps to eliminate hypothesis and fills in explanations, leading to

shorter hypotheses.
Entailment constraints: Logical definition of different learning types.
Explanation based learning (EBL): Explain the examples and generalize the expla-
nation.
Relevance base learning (RBL): Use prior knowledge in the form of determinations
to identify the relevant attributes.
Knowledge based inductive learning (KBIL): Finds inductive hypotheses that explain
sets of observations.
Inductive logic programming (ILP):

Perform KBIL using knowledge expressed in first-order logic.
Generates new predicates with which concise new theories can be expressed.

Chapter 31
Reinforcement Learning
31.1 Reinforcement Learning: Introduction & Motivation

Unsupervised Learning
So far: we have studied “learning from examples”. (functions, logical theories,
probability models)
Now: How can agents learn “what to do” in the absence of labeled examples of
“what to do”. We call this problem unsupervised learning.
Example 31.1.1 (Playing Chess). Learn transition models for own moves and
maybe predict opponent’s moves.
Problem: The agent needs to have some feedback about what is good/bad
; cannot decide “what to do” otherwise. (recall: external performance standard
for learning agents)
Example 31.1.2. The ultimate feedback in chess is whether you win, lose, or draw.
Definition 31.1.3. We call a learning situation where there are no labeled examples
unsupervised learning and the feedback involved a reward or reinforcement.
Example 31.1.4. In soccer, there are intermediate reinforcements in the shape of
goals, penalties, . . .
Reinforcement Learning as Policy Learning

Definition 31.1.5. Reinforcement learning is a type of unsupervised learning where
an agent learn how to behave in a environment by performing actions and seeing
the results.
Recap: In section 27.1 we introduced rewards as parts of MDPs (Markov decision
processes) to define optimal policies.
637
638 CHAPTER 31. REINFORCEMENT LEARNING
an optimal policy maximizes the expected total reward.

Idea: The task of reinforcement learning is to use observed rewards to come up
with an optimal policy.
In MDPs, the agent has total knowledge about the environment and the reward
function, in reinforcement learning we do not assume this. (; POMDPs+rewards)
Example 31.1.6. You play a game without knowing the rules, and at some time
the opponent shouts you lose!
Scope and Forms of Reinforcement Learning

Reinforcement Learning solves all of AI: An agent is placed in an environment
and must learn to behave successfully therein.
KISS: We will only look at simple environments and simple agent designs:
A utility-based agent learns a utility function on states and uses it to select
actions that maximize the expected outcome utility. (passive learning)
A Q-learning agent learns an action-utility function, or Q-function, giving the
expected utility of taking a given action in a given state. (active learning)
A reflex agent learns a policy that maps directly from states to actions.
31.2 Passive Learning

Passive Learning
Definition 31.2.1 (To keep things simple). Agent uses a state-based represen-
tation in a fully observable environment:
In passive learning, the agent’s policy π is fixed: in state s, it always executes

the action π(s).
Its goal is simply to learn how good the policy is – that is, to learn the utility
function U π (s).
The passive learning task is similar to the policy evaluation task (part of the policy
iteration algorithm) but the agent does not know
the transition model P (s|s, a), which specifies the probability of reaching state
s′ from state s after doing action a,
the reward function R(s), which specifies the reward for each state.

with s in particular is that it’s an optimal policy when s is the starting state. A remarkable
consequence of using discounted utilities with infinite horizons is that the optimal policy is
independent of the starting state. (Of course, the action sequence won’t be independent;
remember that a policy is a function specifying an action for each state.) This fact seems
intuitively obvious: if policy πa∗ is optimal starting in a and policy πb∗ is optimal starting in b,
then, when they reach a third state c, there’s no good reason for them to disagree with each
other, or with πc∗ , about what to do next.2 So we can simply write π ∗ for an optimal policy.
∗
Given this definition, the true utility of a state is just U π (s)—that is, the expected
31.2. PASSIVE LEARNING 639
sum of discounted rewards if the agent executes an optimal policy. We write this as U (s),
matching the notation used in Chapter 16 for the utility of an outcome. Notice that U (s) and
R(s) are quite different quantities; R(s) is the “short term” reward for being in s, whereas
648 Passive Learning by Example Chapter
U (s) is the “long term” total reward from s17.
onward. Making Complex
Figure 17.3 Decisions
shows the utilities for the
4 × 3 world. Notice that the utilities are higher for states closer to the +1 exit, because fewer
steps are
Example 31.2.2 (Passive required to reach
Learning). We the
useexit.
the 4×3 world introduced above
+1 +1
–1 –1
3 +1 3 0.812 0.868 0.918 +1
R(s) < –1.6284 – 0.4278 < R(s) < – 0.0850

2 –1 2 0.762 0.660 –1
+1 +1
1 1 0.705 0.655 0.611 0.388

–1 –1
1 2 3 4 1 2 3 4
Optimal Policy π
Figure 17.3 The utilities Utilities,
of the states given
in <
the π
– 0.0221 < R(s) 0 4 × 3 world, calculated
R(s) > 0 with γ = 1 and
R(s) = − 0.04 for nonterminal states.
The agent executes a(a)set of trials in the environment using its policy
(b) π.
Figurethe
In each trial, 17.2agent
(a)starts
AnTheoptimal policy
in state
utility forU the
(1,1)
function and
(s) stochastic
the environment
experiences
allows with
a select
agent to sequence R(s) =using
ofbystate
actions theinprinciple of
− 0.04
the nonterminal states.
maximum
transitions until it reaches (b)
one of Optimal
expected policies
utility from
the terminal for four
Chapter
states, different
16—that
(4,2) ranges of R(s).
is, choose the action that maximizes the
or (4,3).
expected utility of the subsequent state:
Its percepts supply both the current state! and the reward received in that state.
and (3,3) are as shown, every π ∗ (s) = policy
argmax is optimal,
P (s# |and
s, a)U the(sagent
#
). obtains infinite total reward be- (17.4)
cause it never enters a terminal state. a∈A(s)Surprisingly,
s" it turns out that there are six other optimal
policies for various ranges
Michael Kohlhase: of Intelligence
Artificial R(s); Exercise
2 17.51081asks you to find them.
2023-02-10
The next two sections describe algorithms for finding optimal policies.
The careful balancing of risk and reward is a characteristic of MDPs that does not
arise in deterministic searchthisproblems;
2 Although seems obvious,moreover,
it does not it is fora finite-horizon
hold characteristic of ormany
policies real-world
for other ways of combining
Passivedecision
Learning
problems.byrewards
Example
Forover thistime.reason,
The proof MDPs
follows have
directlybeen
from the studied
uniquenessinofseveral
the utility fields, including
function on states, as shown in
Section 17.2.
AI, operations research, economics, and control theory. Dozens of algorithms have been
Example
proposed31.2.3. Typical trials
for calculating optimal might look like
policies. this:
In sections 17.2 and 17.3 we describe two of the
most important algorithm families. First, however, we must complete our investigation of
1. (1,utilities ; (1,
1)−0.4 and 2)−0.4for;sequential
policies (1, 3)−0.4decision ; (1, 2) −0.4 ; (1, 3)−0.4 ; (2, 3)−0.4 ;
problems.
(3, 3)−0.4 ; (4, 3)+1
2. (1,17.1.1
1)−0.4 ; Utilities
(1, 2)−0.4 over;time
(1, 3)−0.4 ; (2, 3)−0.4 ; (3, 3)−0.4 ; (3, 2)−0.4 ;
(3, 3)−0.4 ; (4, 3)+1
In the MDP example in Figure 17.1, the performance of the agent was measured by a sum of
3. (1, 1)−0.4 ; (2, 1)−0.4 ; (3, 1)−0.4 ; (3, 2)−0.4 ; (4, 2)−1 .
rewards for the states visited. This choice of performance measure is not arbitrary, but it is
not the only
Definition 31.2.4. possibility for the
The utility is utility
definedfunction
to be the on environment
expected sum histories, which we write as
of (discounted)
rewards obtained
Uh ([s 0 , s1 , . . .if, s n ]). Our
policy π isanalysis draws on multiattribute utility theory (Section 16.4) and
followed.
is somewhat technical; the impatient "reader may wish # to skip to the next section.
X∞
FINITE HORIZON The first question to answer is whether there is a finite horizon or an infinite horizon
U π (s):=E γ t R(S )
INFINITE HORIZON for decision making. A finite horizon means that tthere is a fixed time N after which nothing
t=0
matters—the game is over, so to speak. Thus, Uh ([s0 , s1 , . . . , sN +k ]) = Uh ([s0 , s1 , . . . , sN ])
whereforR(s)
all kis>the 0. Forrewardexample,
for a suppose
state, Stan(aagent
random startsvariable)
at (3,1) in
is the 4state
× 3 world
reachedof Figure
at 17.1,
time and suppose
t when that Npolicy
executing = 3. Then,
π, andtoShave
0 = any
s. (forchance
4 × of
3 wereaching
take the
the +1 state,
discount the
factoragent must
γ = 1)head directly for it, and the optimal action is to go Up. On the other hand, if N = 100,
then there is plenty of time to take the safe route by going Left. So, with a finite horizon,
Direct Utility Estimation

A simple method for direct utility estimation was invented in the late 1950s in the
area of adaptive control theory.
Definition 31.2.5. The utility of a state is the expected total reward from that
state onward (called the expected reward to go).
Idea: Each trial provides a sample of the reward to go for each state visited.
Example 31.2.6. The first trial in Example 31.2.3 provides a sample total reward
of 0.72 for state (1,1), two samples of 0.76 and 0.84 for (1,2), two samples of 0.80
and 0.88 for (1,3), . . .
Definition 31.2.7. The direct utility estimation algorithm cycles over trials, cal-
culates the reward to go for each state, and updates the estimated utility for that
state by keeping the running average for that for each state in a table.
Observation 31.2.8. In the limit, the sample average will converge to the true
expectation (utility) from Definition 31.2.4.
Remark 31.2.9. Direct utility estimation is just supervised learning, where each
example has the state as input and the observed reward to go as output.
Upshot: We have reduced reinforcement learning to an inductive learning problem.
Adaptive Dynamic Programming

Problem: The utilities of states are not independent in direct utility estimation!
The utility of each state equals its own reward plus the expected utility of its
successor states.
So: The utility values obey a Bellman equation for a fixed policy π.
X
U π (s) = R(s) + γ · ( P (s′ |s, π(s)) · U π (s′ ))
s′
Observation 31.2.10. By ignoring the connections between states, direct utility

estimation misses opportunities for learning.
Example 31.2.11. Recall trial 2 in Example 31.2.3; state (3,3) is new.

2 (1, 1)−0.4 ; (1, 2)−0.4 ; (1, 3)−0.4 ; (2, 3)−0.4 ; (3, 3)−0.4 ; (3, 2)−0.4 ;
(3, 3)−0.4 ; (4, 3)+1
The next transition reaches (3,3), (known high utility from trial 1)
Bellman equation: ; high U (3, 2) because (3, 2)−0.4 ; (3, 3)
π
But direct utility estimation learns nothing until the end of the trial.
Intuition: Direct utility estimation searches for U in a hypothesis space that too
large ⇝ many functions that violate the Bellman equations.
31.2. PASSIVE LEARNING 641
Thus the algorithm often converges very slowly.
Adaptive Dynamic Programming

Idea: Take advantage of the constraints among the utilities of states by
learning the transition model that connects them,
solving the corresponding Markov decision process using a dynamic programming
method.
This means plugging the learned transition model P (s′ |s, π(s)) and the observed
rewards R(s) into the Bellman equations (21.2) to calculate the utilities of the
states.
As above: these equations are linear (no maximization involved) (solve with any
any linear algebra package).
Observation 31.2.12. Learning the model itself is easy, because the environment
is fully observable.
Corollary 31.2.13. We have a supervised learning task where the input is a

state–action pair and the output is the resulting state.
In the simplest case, we can represent the transition model as a table of proba-
bilities.
Count how often each action outcome occurs and estimate the transition prob-
ability P (s′ |s, a) from the frequency with which s′ is reached by action a in
s.
Example 31.2.14. In the 3 trials from Example 31.2.3, Right is executed 3 times
in (1, 3) and 2 times the result is (2, 3), so P ((2, 3)|(1, 3), Right) is estimated to
be 2/3.
Passive ADP Learning Algorithm

Definition 31.2.15. The passive ADP algorithm is given by
function PASSIVE−ADP−AGENT(percept) returns an action
inputs: percept, a percept indicating the current state s′ and reward signal r′
persistent: π a fixed policy
mdp, an MDP with model P , rewards R, discount γ
U , a table of utilities, initially empty
Nsa , a table of frequencies for state−action pairs, initially zero
Ns′ |sa , a table of outcome frequencies given state−action pairs, initially zero
s, a, the previous state and action, initially null
if s′ is new then U [s′ ] := r′ ; R[s′ ] := r′
if s is not null then
increment Nsa [s, a] and Ns′ |sa [s′ , s, a]
for each t such that Ns]|sa [t, s, a] is nonzero do

P (t|s, a) :=Ns′ |sa [t, s, a]/Nsa [s, a]
U := POLICY−EVALUATION(π,mdp)
if s′ .TERMINAL? then s, a := null else s, a := s′ , π[s′ ]
return a
P∞
POLICY−EVALUATION computes U π (s):=E [ t=0 γ t R(st )] in a MDP.
Passive ADP Convergence
Example 31.2.16 (Passive ADP learning curves for the 4x3 world).
Given the optimal policy from Example 31.2.2
utility estimates/trials error for U (1, 1): 20 runs of 100 trials
Note the large changes occurring around the 78th trial – this is the first time that
the agent falls into the -1 terminal state at (4,2).
Observation 31.2.17. The ADP agent is limited only by its ability to learn the
transition model. (intractable for large state spaces)
Example 31.2.18. In backgammon, roughly 1050 equations in 1050 unknowns.

Idea: Use this as a baseline to compare passive learning algorithms
31.3 Active Reinforcement Learning
Active Reinforcement Learning

Recap: A passive learning agent has a fixed policy that determines its behavior.
An active agent must also decide what actions to take.
Idea: Adapt the passive ADP algorithm to handle this new freedom.
learn a complete model with outcome probabilities for all actions, rather than
just the model for the fixed policy. (use PASSIVE-ADP-AGENT)
31.3. ACTIVE REINFORCEMENT LEARNING 643
choose actions; the utilities to learn are defined by the optimal policy, they obey
the Bellman equation:
X
U (s) = R(s) + γ · max ( U (s′ ) · P (s′ |s, a))
a∈A(s)
s′
solve with value/policy iteration techniques from section 27.3.

choose a good action, e.g.
by one-step lookahead to maximize expected utility, or
if it uses policy iteration and has optimal policy, execute that.
This agent/algorithm is greedy, since it only optimizes the next step
Greedy ADP Learning (Evaluation)
Example 31.3.1 (Greedy ADP learning curves for the 4x3 world).
average error/loss suboptimal policy involved
The agent follows the optimal policy for the learned model at each step.
It does not learn the true utilities or the true optimal policy!
instead, in the 39th trial, it finds a policy that reaches the +1 reward along the
lower route via (2,1), (3,1), (3,2), and (3,3).
After experimenting with minor variations, from the 276th trial onward it sticks
to that policy, never learning the utilities of the other states and never finding
the optimal route via (1,2), (1,3), and (2,3).
Exploration in Active Reinforcement Learning

Observation 31.3.2. Greedy active ADP learning agents very seldom converge
against the optimal solution
The learned model is not the same as the true environment,
What is optimal in the learned model need not be in the true environment.
What can be done? The agent does not know the true environment.
Idea: actions do more than provide rewards according to the learned model
they also contribute to learning the true model by affecting the percepts received.
By improving the model, the agent may reap greater rewards in the future.
Observation 31.3.3. An agent must make a tradeoff between
exploitation to maximize its reward as reflected in its current utility estimates
and
exploration to maximize its long term well-being.
Pure exploitation risks getting stuck in a rut. Pure exploration to improve one’s
knowledge is of no use if one never puts that knowledge into practice.
Compare with the information gathering agent from section 25.6.

Part VII
Communication
645
647
A Video Nugget covering this part can be found at https://fau.tv/clip/id/35294.

This part introduces the basics of natural language processing and the use of natural language for
communication with humans.
Fascination of (Natural) Language

Definition 31.3.4. A natural language is any form of spoken or signed means
communication that has evolved naturally in humans through use and repetition
without conscious planning or premeditation.
In other words: the language you use all day long, e.g. English, German, . . .
Why Should we care about natural language?:
Even more so than thinking, language is a skill that only humans have.
It is a miracle that we can express complex thoughts in a sentence in a matter
of seconds.
It is no less miraculous that a child can learn tens of thousands of words and a
complex grammar in a matter of a few years.
Natural Language and AI

Without natural language capabilities (understanding and generation) no AI!
Ca. 100.000 years ago, humans learned to speak, ca. 7.000 years ago, to write.
Alan Turing based his test on natural language: (for good reason)
We want AI agents to be able to communicate with humans.
We want AI agents to be able to acquire knowledge from written documents.
In this part, we analyze the problem with specific information-seeking tasks:
Language models (Which strings are English/Spanish/etc.)
Text classification (E.g. spam detection)
Information retrieval (aka. Search Engines)
Information extraction (finding objects and their relations in texts)

648
Chapter 32
32.1 Introduction to NLP

The general context of AI-2 is natural language processing (NLP), and in particular natural
language understanding (NLU). The dual side of NLU: natural language generation (NLG) requires
similar foundations, but different techniques is less relevant for the purposes of this course.
What is Natural Language Processing?

Generally: Studying of natural languages and development of systems that can
use/generate these.
Definition 32.1.1. Natural language processing (NLP) is an engineering field at
the intersection of computer science, artificial intelligence, and linguistics which is
concerned with the interactions between computers and human (natural) languages.
Most challenges in NLP involve:
Natural language understanding (NLU) that is, enabling computers to derive
meaning (representations) from human or natural language input.
Natural language generation (NLG) which aims at generating natural language
or speech from meaning representation.
For communication with/among humans we need both NLU and NLG.
Language Technology
Language Assistance:
written language: Spell/grammar/style-checking,
spoken language: dictation systems and screen readers,
multilingual text: machine-supported text and dialog translation, eLearning.
Information management:
649
650 CHAPTER 32. NATURAL LANGUAGE PROCESSING
search and classification of documents, (e.g. Google/Bing)

information extraction, question answering. (e.g. http://ask.com)
Dialog Systems/Interfaces:
information systems: at airport, tele-banking, e-commerce, call centers,

dialog interfaces for computers, robots, cars. (e.g. Siri/Alexa)
Observation: The earlier technologies largely rely on pattern matching, the latter
ones need to compute the meaning of the input utterances, e.g. for database lookups
in information systems.
32.2 Natural Language and its Meaning

A Video Nugget covering this section can be found at https://fau.tv/clip/id/35295. Be-
fore we embark on the journey into understanding the meaning of natural language, let us get an
overview over what the concept of “semantics” or “meaning” means in various disciplines.
What is (NL) Semantics? Answers from various Disciplines!

Observation: Different (academic) disciplines specialize the notion of semantics
(of natural language) in different ways.
Philosophy: has a long history of trying to answer it, e.g.
Platon ; cave allegory, Aristotle ; Syllogisms.

Frege/Russell ; sense vs. referent. (Michael Kohlhase vs. Odysseus)
Linguistics/Language Philosophy: We need semantics e.g. in translation
Der Geist ist willig aber das Fleisch ist schwach! vs.
Der Schnaps ist gut, aber der Braten ist verkocht! (meaning counts)
Psychology/Cognition: Semantics =
b “what is in our brains” (; mental models)
Mathematics has driven much of modern logic in the quest for foundations.
Logic as “foundation of mathematics” solved as far as possible
In daily practice syntax and semantics are not differentiated (much).
Logic@AI/CS tries to define meaning and compute with them. (applied

semantics)
makes syntax explicit in a formal language (formulae, sentences)
defines truth/validity by mapping sentences into “world” (interpretation)
gives rules of truth-preserving reasoning (inference)
A good probe into the issues involved in natural language understanding is to look at trans-
lations between natural language utterances – a task that arguably involves understanding the
utterances first.
32.2. NATURAL LANGUAGE AND ITS MEANING 651
Meaning of Natural Language; e.g. Machine Translation
Idea: Machine Translation is very simple! (we have good lexica)

Example 32.2.1. Peter liebt Maria. ; Peter loves Mary.
this only works for simple examples!
Example 32.2.2. Wirf der Kuh das Heu über den Zaun. ̸;Throw the cow the
hay over the fence. (differing grammar; Google Translate)
Example 32.2.3. Grammar is not the only problem
Der Geist ist willig, aber das Fleisch ist schwach!
Der Schnaps ist gut, aber der Braten ist verkocht!
Observation 32.2.4. We have to understand the meaning for high-quality trans-
lation!
If it is indeed the meaning of natural language, we should look further into how the form of the
utterances and their meaning interact.
Language and Information

Observation: Humans use words (sentences, texts) in natural languages to rep-
resent and communicate information.
But: What really counts is not the words themselves, but the meaning information
they carry.
Example 32.2.5 (Word Meaning).
Newspaper ;
For questions/answers, it would be very useful to find out what words (sentences/-
texts) mean.
Interpretation of natural language utterances: three problems
schema abstraction ambiguity composition

semantic
intepretation
language
utterance
Let us support the last claim a couple of initial examples. We will come back to these phenomena
again and again over the course of the course and study them in detail.
Language and Information (Examples)
Example 32.2.6 (Abstraction).
Car and automobile have the same meaning
Example 32.2.7 (Ambiguity).
A bank can be a financial institution or a geographical feature
Example 32.2.8 (Composition).
Every student sleeps ; ∀x.student(x) ⇒ sleep(x)
But there are other phenomena that we need to take into account when compute the meaning of
NL utterances.
Context Contributes to the Meaning of NL Utterances

Observation: Not all information conveyed is linguistically realized in an utterance.
Example 32.2.9. The lecture begins at 11:00 am. What lecture? Today?
Definition 32.2.10. We call a piece i of information linguistically realized in an

utterance U , iff, we can trace i to a fragment of U .
Definition 32.2.11 (Possible Mechanism). Inferring the missing pieces from the
context and world knowledge:
Grammar Inference
relevant
Utterance Meaning information
of utterance
Lexicon World knowledge
We call this process pragmatic analysis.
We will look at another example, that shows that the situation with pragmatic analysis is even
more complex than we thought. Understanding this is one of the prime objectives of the AI-2
lecture.
32.2. NATURAL LANGUAGE AND ITS MEANING 653
Context Contributes to the Meaning of NL Utterances

Example 32.2.12. It starts at eleven. What starts?
Before we can resolve the time, we need to resolve the anaphor it.
Possible Mechanism: More Inference!
Grammar Inference
utterance- relevant
semantic
Utterance specific information
potential
meaning of utterance
Lexicon World/Context Knowledge
; Pragmatic analysis is quite complex! (prime topic of AI-2)
Example 32.2.12 is also a very good example for the claim Observation 32.2.4 that even for high-
quality (machine) translation we need semantics. We end this very high-level introduction with
a caveat.
Semantics is not a Cure-It-All!

How many animals of each species did Moses take onto the ark?
Actually, it was Noah (But you understood the question anyways)

But Semantics works in some cases

The only thing that currently really helps is a restricted domain:
I. e. a restricted vocabulary and world model.
Demo:
DBPedia http://dbpedia.org/snorql/
Query: Soccer players, who are born in a country with more than 10 million in-
habitants, who played as goalkeeper for a club that has a stadium with more than
30.000 seats and the club country is different from the birth country
But Semantics works in some cases

Answer:
(is computed by DBPedia from a SPARQL Query)
Even if we can get a perfect grasp of the semantics (aka. meaning) of NL utterances, their structure
and context dependency – we will try this in this lecture, but of course fail, since the issues are
much too involved and complex for just one lecture – then we still cannot account for all the
human mind does with language. But there is hope, for limited and well-understood domains,
we can to amazing things. This is what this course tries to show, both in theory as well as in
practice.
32.3. LOOKING AT NATURAL LANGUAGE 655
32.3 Looking at Natural Language

A Video Nugget covering this section can be found at https://fau.tv/clip/id/35296. The
next step will be to make some observations about natural language and its meaning, so that we
get an intuition of what problems we will have to overcome on the way to modeling natural
language.
Fun with Diamonds (are they real?) [Dav67]

Example 32.3.1. We study the truth conditions of adjectival complexes:
This is a diamond. (|= diamond)

This is a blue diamond. (|= diamond, |= blue)
This is a big diamond. (|= diamond, ̸|= big)
This is a fake diamond. (|= ¬diamond)
This is a fake blue diamond. (|= blue?, |= diamond?)
Mary knows that this is a diamond. (|= diamond)
Mary believes that this is a diamond. (̸|= diamond)
Logical analysis vs. conceptual analysis: These examples — mostly borrowed from David-
son:tam67 — help us to see the difference between "‘logical-analysis’ and "‘conceptual-analysis’.
We observed that from This is a big diamond. we cannot conclude This is big. Now consider the
sentence Jane is a beautiful dancer. Similarly, it does not follow from this that Jane is beautiful,
but only that she dances beautifully. Now, what it is to be beautiful or to be a beautiful dancer
is a complicated matter. To say what these things are is a problem of conceptual analysis. The
job of semantics is to uncover the logical form of these sentences. Semantics should tell us that
the two sentences have the same logical forms; and ensure that these logical forms make the right
predictions about the entailments and truth conditions of the sentences, specifically, that they
don’t entail that the object is big or that Jane is beautiful. But our semantics should provide a
distinct logical form for sentences of the type: This is a fake diamond. From which it follows that
the thing is fake, but not that it is a diamond.
Ambiguity: The dark side of Meaning

Definition 32.3.2. We call an utterance ambiguous, iff it has multiple meanings,
which we call readings.
Example 32.3.3. All of the following sentences are ambiguous:
John went to the bank. (river or financial?)

You should have seen the bull we got from the pope. (three readings!)
I saw her duck. (animal or action?)
John chased the gangster in the red sports car. (three-way too!)
One way to think about the examples of ambiguity on the previous slide is that they illustrate a
certain kind of indeterminacy in sentence meaning. But really what is indeterminate here is what
sentence is represented by the physical realization (the written sentence or the phonetic string).
The symbol duck just happens to be associated with two different things, the noun and the verb.
Figuring out how to interpret the sentence is a matter of deciding which item to select. Similarly
for the syntactic ambiguity represented by PP attachment. Once you, as interpreter, have selected
one of the options, the interpretation is actually fixed. (This doesn’t mean, by the way, that as
an interpreter you necessarily do select a particular one of the options, just that you can.) A
brief digression: Notice that this discussion is in part a discussion about compositionality, and
gives us an idea of what a non-compositional account of meaning could look like. The Radical
Pragmatic View is a non-compositional view: it allows the information content of a sentence to
be fixed by something that has no linguistic reflex.
To help clarify what is meant by compositionality, let me just mention a couple of other ways
in which a semantic account could fail to be compositional.
• Suppose your syntactic theory tells you that S has the structure [a[bc]] but your semantics
computes the meaning of S by first combining the meanings of a and b and then combining the
result with the meaning of c. This is non-compositional.
• Recall the difference between:
1. Jane knows that George was late.

2. Jane believes that George was late.
Sentence 1. entails that George was late; sentence 2. doesn’t. We might try to account for
this by saying that in the environment of the verb believe, a clause doesn’t mean what it
usually means, but something else instead. Then the clause that George was late is assumed
to contribute different things to the informational content of different sentences. This is a
non-compositional account.
Quantifiers, Scope and Context
Example 32.3.4. Every man loves a woman. (Keira Knightley or his mother!)
Example 32.3.5. Every car has a radio. (only one reading!)
Example 32.3.6. Some student in every course sleeps in every class at least
some of the time. (how many readings?)
Example 32.3.7. The president of the US is having an affair with an intern.
(2002 or 2000?)
Example 32.3.8. Everyone is here. (who is everyone?)
Observation: If we look at the first sentence, then we see that it has two readings:
1. there is one woman who is loved by every man.
2. for each man there is one woman whom that man loves.
These correspond to distinct situations (or possible worlds) that make the sentence true.
Observation: For the second example we only get one reading: the analogue of 2. The reason
for this lies not in the logical structure of the sentence, but in concepts involved. We interpret
the meaning of the word has as the relation “has as physical part”, which in our world carries a
certain uniqueness condition: If a is a physical part of b, then it cannot be a physical part of c,
32.3. LOOKING AT NATURAL LANGUAGE 657
unless b is a physical part of c or vice versa. This makes the structurally possible analogue to 1.
impossible in our world and we discard it.
Observation: In the examples above, we have seen that (in the worst case), we can have one
reading for every ordering of the quantificational phrases in the sentence. So, in the third example,
we have four of them, we would get 4! = 24 readings. It should be clear from introspection that
we (humans) do not entertain 12 readings when we understand and process this sentence. Our
models should account for such effects as well.
Context and Interpretation: It appears that the last two sentences have different informational
content on different occasions of use. Suppose I say Everyone is here. at the beginning of class.
Then I mean that everyone who is meant to be in the class is here. Suppose I say it later in the
day at a meeting; then I mean that everyone who is meant to be at the meeting is here. What
shall we say about this? Here are three different kinds of solution:
Radical Semantic View On every occasion of use, the sentence literally means that everyone
in the world is here, and so is strictly speaking false. An interpreter recognizes that the speaker
has said something false, and uses general principles to figure out what the speaker actually
meant.
Radical Pragmatic View What the semantics provides is in some sense incomplete. What the
sentence means is determined in part by the context of utterance and the speaker’s intentions.
The differences in meaning are entirely due to extra-linguistic facts which have no linguistic
reflex.
The Intermediate View The logical form of sentences with the quantifier every contains a slot
for information which is contributed by the context. So extra-linguistic information is required
to fix the meaning; but the contribution of this information is mediated by linguistic form.
More Context: Anaphora
John is a bachelor. His wife is very nice. (Uh, what?, who?)

John likes his dog Spiff even though he bites him sometimes. (who bites?)
John likes Spiff. Peter does too. (what to does Peter do?)
John loves his wife. Peter does too. (whom does Peter love?)
John loves golf, and Mary too. (who does what?)
Context is Personal and keeps changing

The king of America is rich. (true or false?)
The king of America isn’t rich. (false or true?)

If America had a king, the king of America would be rich. (true or false!)
The king of Buganda is rich. (Where is Buganda?)
. . . Joe Smith. . . The CEO of Westinghouse announced budget cuts.
(CEO=J.S.!)
32.4 Language Models

Natural Languages vs. Formal Language

Recap: A formal language is a set of strings.
Example 32.4.1. Programming languages like Java or C++ are formal languages.
Remark 32.4.2. Natural languages like English, German, or Spanish are not.
Example 32.4.3. Let us look at concrete examples
Not to be invited is sad! (definitely English)
To not be invited is sad! (controversial)
Idea: Let’s be lenient, instead of a hard set, use a probability distribution.

Definition 32.4.4. A (statistical) language model is a probability distribution over
sequences of characters or words.
Idea: Try to learn/derive language models from text corpora.
Definition 32.4.5. A text corpus (or simply corpus; plural corpora) is a large and
structured collection of natural language texts.
Definition 32.4.6. In corpus linguistics, corpora are used to do statistical analysis
and hypothesis testing, checking occurrences or validating linguistic rules within a
specific natural language.
N -gram Character Models

Written text is composed of characters letters, digits, punctuation, and spaces.
Idea: Let’s study language models for sequences of characters.

As for Markov processes, we write P (c1:N ) for the probability of a character sequence
c1 . . .cn of length N .
Definition 32.4.7. We call an character sequence of length n an n gram (unigram,
bigram, trigram for n = 1, 2, 3).
Definition 32.4.8. An n gram model is a Markov process of order n − 1.

Remark 32.4.9. For a trigram model, P (ci |c1:i−1 ) = P (ci |c(i−2) , c(i−1) ). Factoring
32.4. LANGUAGE MODELS 659
with the chain rule and then using the Markov property, we obtain
N
Y N
Y
P (c1:N ) = P (ci |c1:i−1 ) = P (ci |c(i−2) , c(i−1) )
i=1 i=1
Thus, a trigram model for a language with 100 characters, P(ci |ci−2:i−1 ) has
1.000.000 entries. It can be estimated from a corpus with 107 characters.
Applications of N -Gram Models of Character Sequences

What can we do with N gram models?
Definition 32.4.10. The problem of language identification is given a text, deter-

mine the natural language it is written in.
Remark 32.4.11. Current technology can classify even short texts like Hello, world,
or Wie geht es Dir correctly with more than 99% accuracy.
One approach: Build a trigram language model P(ci |ci−2:i−1 , ℓ) for each candi-
date language ℓ by counting trigrams in a ℓ-corpus.
Apply Bayes’ rule and the Markov property to get the most likely language:
ℓ∗ = argmax (P (ℓ|c1:N ))
ℓ
= argmax (P (ℓ) · P (c1:N |ℓ))
ℓ
N
Y
= argmax (P (ℓ) · P (ci |ci−2:i−1 , ℓ))
ℓ i=1
The prior probability P (ℓ) can be estimated, it is not a critical factor, since the
trigram language models are extremely sensitive.
Other Applications of Character N -Gram Models

Spelling correction is a direct application of a single-language language model:
Estimate the probability of a word and all off-by-one variants.
Definition 32.4.12. Genre classification means deciding whether a text is a news
story, a legal document, a scientific article, etc.
Remark 32.4.13. While many features help make this classification, counts of
punctuation and other character n-gram features go a long way [KNS97].
Definition 32.4.14. Named entity recognition (NER) is the task of finding names
of things in a document and deciding what class they belong to.
Example 32.4.15. In Mr. Sopersteen was prescribed aciphex. NER should
recognize that Mr. Sopersteen is the name of a person and aciphex is the name of
a drug.
Remark 32.4.16. Character-level language models are good for this task because
they can associate the character sequence ex with a drug name and steen with a
person name, and thereby identify words that they have never seen before.
N -Grams over Word Sequences

Idea: n gram models apply to word sequences as well.
Problems: The method works identically, but
1. There are many more words than characters. (100 vs. 105 in Englisch)
2. And what is a word anyways? (space/punctuation-delimited substrings?)
3. Data sparsity: we do not have enough data! For a language model| for (105 )
words in English, we have 1015 trigrams.
4. Most training corpora do not have all words.
Word N -Grams: Out-of-Vocab Words

Definition 32.4.17. Out of vocabulary (OOV) words are unknown words that
appear in the test corpus but not training corpus.
Remark 32.4.18. OOV words are usually content words such as names and locations
which contain information crucial to the success of NLP tasks.
Idea: Model OOV words by

1. adding a new word token, e.g. <UNK> to the vocabulary,
2. in the training corpus, replacing the respective first occurrence of a previously
unknown word by <UNK>,
3. counting n grams as usual, treating <UNK> as a regular word.
This trick can be refined if we have a word classifier, then use a new token per class,
e.g. <EMAIL> or <NUM>.
What can Word N -Gram Models do?

Example 32.4.19 (Test n-grams). Build unigram, bigram, and trigram language
models over the words [RN03], randomly sample sequences from the models.
1. Unigram: logical are as are confusion a may right tries agent goal the was . . .
32.5. INFORMATION RETRIEVAL 661
2. Bigram: systems are very similar computational approach would be represented . . .

3. Trigram: planning and scheduling are integrated the success of naive bayes model . . .
Clearly there are differences, how can we measure them to evaluate the models?
Definition 32.4.20. The perplexity of a sequence c1:N is defined as
1
−( N )
Perplexity(c1:N ):=P (c1:N )
Intuition: The reciprocal of probability, normalized by sequence length.

Example 32.4.21. For a language with n characters or words and a language
model that predicts that all are equally likely, the perplexity of any sequence is n.
If some characters or words are more likely than others, and the model reflects that,
then the perplexity of correct sequences will less than n.
Example 32.4.22. In Example 32.4.19, the perplexity was 891 for the unigram
model, 142 for the bigram model and 91 for the trigram model.
32.5 Information Retrieval

Information Retrieval
Definition 32.5.1. Information retrieval (IR) deals with the representation, orga-
nization, storage, and maintenance of information objects so that it provides the
user with easy access to the relevant information and satisfies the user’s various
information needs.
We normally come in contact with IR in the form of web search.

Definition 32.5.2. Web search is a fully automatic process that responds to a
user query by returning a sorted document list relevant to the user requirements
expressed in the query.
Example 32.5.3. Google and Bing are web search engines, their query is a bag
of words and documents are web pages, PDFs, images, videos, shopping portals.
Vector Space Models for IR

Idea: For web search, we usually represent documents and queries as bags of
words over a fixed vocabulary V . Given a query Q, we return all documents that
are “similar”.
Definition 32.5.4. Given a vocabulary (a list) V of words, a word w∈V , and

a document d, then we define the raw term frequency (often just called the term
frequency) of w in d as the number of occurrences of w in d.
Definition 32.5.5.
A multiset of words in V = {t1 , . . ., tn } is called a bag of words (BOW), and can be
represented as a word frequency vectors in N|V | : the vector of raw word frequencies.
Example 32.5.6.
If we have two documents: d1 = Have a good day! and d2 = Have a great day!,
then we can use V = Have, a, good, great, day and can represent good as ⟨0, 0, 1, 0, 0⟩,
great as ⟨0, 0, 0, 1, 0⟩, and d1 a ⟨1, 1, 1, 0, 1⟩.
Words outside the vocabulary are ignored in the BOW approach. So the document
d3 = What a day, a good day is represented as ⟨0, 2, 1, 0, 2⟩.
Vector Space Models for IR

Idea: Query and document are similar, iff the angle between their word frequency
vectors is small.
term 1
D1 (t1,1 , t1,2 , t1,3 )
D2 (t2,1 , t2,2 , t2,3 )
term 3
term 2
Lemma 32.5.7 (Euclidean Dot Product Formula).

A·B = ∥A∥2 ∥B∥2 cos θ, where θ is the angle between A and B.
Definition 32.5.8. The cosine similarity of A and B is cos θ = A·B
∥A∥2 ∥B∥2 .
TF-IDF: Term Frequency/Inverse Document Frequency

Problem: Word frequency vectors treat all the words equally.
Example 32.5.9. In an query the brown cow, the the is less important than brown
cow. (because the is less specific)
Idea: Introduce a weighting factor for the word frequency vector that de-emphasizes
the dimension of the more (globally) frequent words.
We need to normalize the word frequency vectors first:
Definition 32.5.10. Given a document d and a vocabulary word t∈V , the normal-
ized term frequency (also usually called just term frequency) tf(t, d) is the raw term
frequency divided by |d|.
32.5. INFORMATION RETRIEVAL 663
Definition 32.5.11.
Given a document collection D = {d1 , . . ., dN } and a word t the inverse document
N
frequency is given by idf(t, D):=log10 ( |{d∈D|t∈d}| ).
Definition 32.5.12. We define tfidf(t, d, D):=tf(t, d) · idf(t, D).

Idea: Use the tfidf-vector with cosine similarity for information retrieval instead.
TF-IDF Example
Let D:={d1 , d2 } be a document corpus over the vocabulary
V = {this, is, a, sample, another, example}
with word frequency vectors ⟨1, 1, 1, 2, 0, 0⟩ and ⟨1, 1, 0, 0, 2, 3⟩.
Then we compute for the word this

1 1
tf(this, d1 ) = 5 = 0.2 and tf(this, d2 ) = 7 ≊ 0.14,
idf is constant over D, we have idf(this, D) = log10 ( 22 ) = 0,
thus tfidf(this, d1 , D) = 0 = tfidf(this, d2 , D). (this occurs in both)
The word example is more interesting, since it occurs only in d2 (thrice)

0 3
tf(example, d1 ) = 5 = 0 and tf(example, d2 ) = 7 ≊ 0.429.
idf(example, D) = log10 ( 12 ) ≊ 0.301,
thus tfidf(example, d1 , D) = 0 · 0.301 = 0 and tfidf(example, d2 , D) ≊ 0.429 ·
0.301 = 0.129.
Once an answer set has been determined, the results have to be sorted, so that they can be
presented to the user. As the user has a limited attention span – users will look at most at three
to eight results before refining a query, it is important to rank the results, so that the hits that
contain information relevant to the user’s information need early. This is a very difficult problem,
as it involves guessing the intentions and information context of users, to which the search engine
has no access.
Ranking Search Hits: e.g. Google’s Page Rank
Problem: There are many hits, need to sort them (e.g. by importance)
Idea: A web site is important, . . . if many other hyperlink to it.
Refinement: . . . , if many important web pages hyperlink to it.

Definition 32.5.13. Let A be a web page that is hyperlinked from web pages
S1 , . . . , Sn , then the page rank PR of A is defined as

PR(S1 ) PR(Sn )
PR(A) = 1 − d + d + ··· +
C(S1 ) C(Sn )
where C(W ) is the number of links in a page W and d = 0.85.

Remark 32.5.14. PR(A) is the probability of reaching A by random browsing.
Getting the ranking right is a determining factor for success of a search engine. In fact, the early
of Google was based on the pagerank algorithm discussed above (and the fact that they figured
out a revenue stream using text ads to monetize searches).
32.6 Word Embeddings

Word Embeddings
Problem: For ML methods in NLP, we need numerical data. (not words)

Idea: Embed words or word sequences into real-valued vector spaces.
Definition 32.6.1. A word embedding is a mapping from words in context into a
real valued vector space Rn used for natural language processing.
Definition 32.6.2. A vector is called one hot, iff all components are 0 except for
one 1. We call a word embedding one hot, iff all of its vectors are.
Example 32.6.3 (Vector Space Methods in Information Retrieval).
Word frequency vectors are induced by adding up one hot word embeddings.
Example 32.6.4. Given a document corpus D – the context – the tf idf word
embedding is given by e : t7→⟨tfidf(t, d1 , D), . . . ,tfidf(t, d#(D) , D)⟩.
Intuition behind these two: Words that occur in similar documents are similar.
Word2Vec: A Popular, Semantic Word Embedding

Distributional Semantics: “a word is characterized by the company it keeps”.
Idea: Find word embeddings that take context into account.
Result Preview: Semantic word embeddings

32.6. WORD EMBEDDINGS 665
as before: words that occur in similar documents are similar.

also in Word2Vec: vector differences encode word relations.
Algorithm Preview: Use a neural network to predict the word corresponding to

an input context.
The Common Bag Of Words (CBOW) Algorithm I

Idea: For the intended behavior
we need to maintain linear regularities, i.e. additive vector properties like:
V (King) − V (M an) + V (Queen) is close to V (W oman)
Example 32.6.5. For the text watch movies rather than read books
context size: 2, target rather, we have
context: C:={watch, movies, than, read}
Vocabulary: V :={watch, movies, rather, than, read, books}
So in CBOW, build a neural network that

given the input {watch, movies, than, read} produces rather.
The Common Bag Of Words (CBOW) Algorithm II

A CBOW network for a single word
The hidden layer neurons just copy the weighted sum of inputs to the next layer
(no threshold)
The output layer computes the softmax of the hidden nodes
Weighted sums and softmax maintain linear regularities. (as intended)
Definition 32.6.6. The softmax function σ : RK →RK is defined by

ezi
σ(z):= PK
j=1 ezj
The Common Bag Of Words (CBOW) Algorithm III

A neural network for a multiple words
32.6. WORD EMBEDDINGS 667

Chapter 33
Natural Language for

Communication
Outline
Communication
Grammars and syntactic analysis
Problems (real Language Phenomena)
33.1 Communication Phenomena

Communication
“Classical” view (pre-1953):
Language consists of sentences that are true/false. (cf. logic)
“Modern” view (post-1953):
Language is a form of action!
Wittgenstein (1953) Philosophical Investigations [Wit53]

Austin (1962) How to Do Things with Words [Aus62]
Searle (1969) Speech Acts [Sea69]
Why? To change the actions of other agents.
Speech Acts
669
670 CHAPTER 33. NATURAL LANGUAGE FOR COMMUNICATION
Definition 33.1.1. A speech act is an utterance expressed by an individual that

not only presents information but performs an action as well.
Definition 33.1.2. Speech acts achieve the speaker’s goals:
Inform “There’s a pit in front of you.”

Query “Can you see the gold?”
Command “Pick it up!”
Promise “I’ll share the gold with you”
Acknowledge “OK”
Speech act planning requires knowledge of

the situation
the semantic and syntactic conventions
the hearer’s goals, knowledge base, and rationality
Stages in Communication (Informing)

Stages in communication (informing): even here, the situation is complex
Intention S wants to inform H that P

Generation S selects words W to express P in context C
Synthesis S utters words W
Perception H perceives W ′ in context C ′

Analysis H infers possible meanings P1 , . . . Pn of W ′
Disambiguation H infers intended meaning Pi
Incorporation H incorporates Pi into KB
Question: How could this go wrong?

1. Insincerity (S doesn’t believe P )
2. Speech wreck ignition failure (W ̸= W ′ )
3. Ambiguous utterance (n > 1 and Pi ̸= P )
4. Differing understanding of current context (C ̸= C ′ )
33.2 Grammar
33.2. GRAMMAR 671
Phrase Structure Grammars (Motivation)

Problem Recap: We do not have enough text data to build word sequence
language models ⇝ data sparsity.
Idea: Categorize words into classes and then generalize “acceptable word se-
quences” into “acceptable word class sequences” ; phrase structure grammars.
Advantage: We can get by with much less information.

Example 33.2.1 (Generative Capacity). 103 structural rules over a lexicon of
105 words generate most German sentences.
Vervet monkeys, antelopes etc. use isolated symbols for sentences.
; restricted set of communicable propositions, no generative capacity.
Disadvantage: Grammars may over generalize or under generalize.

The formal study of grammars was introduced by Noam Chomsky in 1957 [Cho65].
We fortify our intuition about these – admittedly very abstract – constructions by an example
and introduce some more vocabulary.
Phrase Structure Grammars (cont.)

Example 33.2.2. A simple phrase structure grammar G:
S → NP ; Vi
NP → Article; N
Article → the | a | an
Vi → sleeps | smells | . . .
Here S , is the start symbol, NP , VP , Article, N , and Vi are nonterminals.

Definition 33.2.3. The subset of lexical rules, i.e. those whose body consists of a
single terminal is called its lexicon and the set of body symbols the alphabet. The
nonterminals in their heads are called lexical categories.
Definition 33.2.4. The non-lexicon grammar rules are called structural, and the
nonterminals in the heads are called phrasal categories.
Context-Free Parsing
Recall: The sentences accepted by a grammar are defined “top-down” as those
the start symbol can be rewritten into.
Definition 33.2.5. Bottom up parsing works by replacing any substring that
matches the body of a grammar rule with its head.

Example 33.2.6. Using the Wumpus grammar (below), we get the following parse
trees in bottom up parsing:
VP
NP NP
Pronoun TransVerb Article Noun
I shoot the Wumpus
Traditional linear notation: Also write this as:
[S[N P [P ronoun I]][V P [T ransV erb shoot][N P [Article the][N oun Wumpus]]]]
Bottom up parsing algorithms tend to be more efficient than top-down ones.
Efficient context-free parsing algorithms run in O(n3 ), run at several thousand

words/sec for real grammars.
Theorem 33.2.7. Context-free parsing =
b Boolean matrix multiplication!
; unlikely to find faster practical algorithms. (details in [Lee02])
Grammaticality Judgements
Problem: The formal language LG accepted by a grammar G may differ from the
natural language Ln it supposedly models.
Definition 33.2.8. We say that a grammar G over generates, iff it accepts strings
outside of Ln (false positives) and under generates, iff there are Ln strings (false
negatives) that LG does not accept.
Adjusting LG to agree with LN is a learning problem!

* the gold grab the wumpus
* I smell the wumpus the gold
I give the wumpus the gold
33.2. GRAMMAR 673
* I donate the wumpus the gold

Intersubjective agreement somewhat reliable, independent of semantics!
Real grammars (100–5000 rules) are insufficient even for “proper” English.
Probabilistic, Context-Free Grammars

Recall: We introduced grammars as an efficient substitute for language models.
Problem (Poor Subsitute): Grammars are deterministic language models.
Idea: Add a probabilistic component to grammars.

Definition 33.2.9. A probabilistic context-free grammar (PCFG) is a phrase struc-
ture grammar, where every grammar rule is associated with a probability.
Idea: A PCFG induces a language model by assigning probabilities to its sentences.
Definition 33.2.10. Let G be a PCFG, S a sentence of G, and D a G-derivation

of S, then the probability of D is the product of the probabilites of the grammar
rules in all steps of D. The probability of S is the sum of the probabilities of all its
derivations.
Example: The Wumpus Grammar (Lexicon)

Example 33.2.11 (Wumpus Grammar Lexicon).
Noun → stench[.05] | breeze[.01] | wumpus[.15] | pits[.05] | . . .

Verb → is[.1] | feel[.1] | smells[.05] | stinks[.05] | . . .
TransVerb → see[.1] | shoot[.1] | . . .
Adjective → right[.1] | dead[.05] | smelly[.02] | breezy[.02] | . . .
Adverb → here[.05] | ahead[.05] | nearby[.02] | . . .
Pronoun → me[.1] | you[.03] | I[.1] | it[.1] | . . .
RelPron → that[.4] | which[.15] | who[.2] | whom[.02] | . . .
Name → John[.01] | Mary[.01] | Boston[.01] | . . .
Article → the[.4] | a[.3] | an[.1] | every[.05] | . . .
Preposition → to[.2] | in[.1] | on[.05] | near[.1] | . . .
Conjunction → and[.5] | or[.1] | but[.2] | yet[.2] | . . .
Digit → 0[.2] | 1[.2] | 2[.2] | 3[.2] | 4[.2] | 5[.2] | . . .
Divided into closed and open classes

Wumpus grammar

S → NP ; VP [.9] I + feel a breeze
| S ; Conjunction; S [.1] I feel a breeze + and + I smell a wumpus
NP → Pronoun [.3] I
| Name [.1] John
| Noun [.1] pits
| Article; Noun [.25] the + wumpus
| Article; Adjs; Noun [.05] the + smelly dead + wumpus
| Digit; Digit [.05] 34
| NP ; PP [.1] the wumpus + in 1 3
| NP ; RelClause [.05] the wumpus + that is smelly
VP → Verb [.25] stinks
| TransVerb; NP [.25] see + the Wumpus
| VP ; NP [.25] feel + a breeze
| VP ; Adjective [.05] is + smelly
| VP ; PP [.1] turn + to the east
| VP ; Adverb [.1] go + ahead
Adjs → Adjective [.8] smelly
| Adjective; Adjs [.2] smelly + dead
PP → Prep; NP [1] to + the east
RelClause → that; VP [1] that + is smelly
PCFG Parsing
Example 33.2.12. Reconsidering Example 33.2.6 with the Wumpus grammar
above, we get the PCFG parse tree:
S
.9
VP
.25
NP NP
.3 .25
Pronoun TransVerb Article Noun
.1 .1 .4 .15
I shoot the Wumpus
It has the probability .9 · .3 · .1 · .25 · .1 · .25 · .4 · .15 = 1.013×10−5 .

As this is the only derivation, this is also the probability of the sentence.
Learning PCFG Probabilities from Data

33.3. REAL LANGUAGE 675
Recall: A PCFG has many rules, each with a probability.

Problem: Where do they come from?
Idea: Learn/sample them from data. (but what data?)
Definition 33.2.13. A treebank is a parsed text corpus that annotates syntactic

or semantic sentence structure.
Idea: To learn the probability for the rule R:=H→B, look for parse subtrees that
match R (i.e. with root category H) in the treebank.
If has N subtrees and n match R, then annotate R with probability n/N .
Treebanks have revolutionized practical linguistics.

There are many treebanks: https://en.wikipedia.org/wiki/Treebank
multiple languages (from Abaza to Yoruba)
multiple language types (newswire to spontaneous speech)
Outlook: Treebank less grammar learning (rules and probabilitiess) is possible,
but much more difficult.
The Penn Treebank

Definition 33.2.14. The Penn treebank [MMS93] is a treebank of newswire texts
(nearly 5 million words) annotated with part of speech and parse tree structure,
using human labor assisted by some automated tools.
Example 33.2.15. A tree from the Penn treebank for the sentence
Her eyes were glazed as if she didn’t hear or even see him.:
[ [S [NP−SBJ−2 Her eyes]
[VP were
[VP glazed
[NP ∗−2]
[SBAR−ADV as if
[S [NP−SBJ she]
[VP did n’t
[VP [VP hear [NP ∗−1]]
or
[VP [ADVP even] see [NP ∗−1]]
[NP−1 him]]]]]]]]
.]
Note: two S-rooted subtrees, one with NP−SBJ−2 child and one with NP SBJ.
33.3 Real Language

Real language – Overview

Real human languages provide many problems for NLP:
ambiguity
anaphora
indexicality
vagueness
discourse structure
metonymy
metaphor
noncompositionality
Ambiguity
Squad helps dog bite victim.
Helicopter powered by human flies.
American pushes bottle up Germans.
I ate spaghetti with meatballs.
salad.
abandon.
a fork.
a friend.
We see: Ambiguity can be lexical (polysemy), syntactic, semantic, referential.
Anaphora
Using pronouns to refer back to entities already introduced in the text
After Mary proposed to John, they found a preacher and got married.
For the honeymoon, they went to Hawaii.
Mary saw a ring through the window and asked John for it.
Mary threw a rock at the window and broke it.
Indexicality
33.3. REAL LANGUAGE 677
Indexical sentences refer to utterance situation (place, time, S/H, etc.)

I am over here
Why did you do that?
Metonymy
Using one noun phrase to stand for another
I’ve read Shakespeare.

Chrysler announced record profits.
The ham sandwich on Table 4 wants another beer.
Metaphor
“Non-literal” usage of words and phrases, often systematic:
I’ve tried killing the process but it won’t die. Its parent keeps it alive.
Noncompositionality
basketball shoes
baby shoes
alligator shoes
designer shoes
brake shoes
Noncompositionality
small moon
mere child
alleged murderer
real leather
artificial grass
Advertisement: Logic-Based Natural Language Semantics

Advanced Course: “Logic-Based Natural Language Semantics” (next semester)
Wed. 10:15-11:50 and Thu 12:15-13:50 (expected: ≤ 10 Students)

Contents: (Alternating Lectures and hands-on Lab Sessions)
Foundations of Natural Language Semantics (NLS)
Montague’s Method of Fragments (Grammar, Semantics Constr., Logic)
Implementing Fragments in GLF (Grammatical Framework and MMT)
Inference Systems for Natural Language Pragmatics (tableau machine)
Advanced logical systems for NLS (modal, higher-order, dynamic Logics)
Grading: Attendance & Wakefulness, Project/Homework, Oral Exam.
Course Intent: Groom students for Bachelor/Master Theses and as KWARC

research assistants.

Chapter 34
What did we learn in AI 1/2?

Getting Started
Problem Solving
Planning
Planning Frameworks
Planning Algorithms
Rational Agents as an Evaluation Framework for AI

Agents interact with the environment
679
680 CHAPTER 34. WHAT DID WE LEARN IN AI 1/2?
Section 2.1. Agents and Environments 35

General agent schema
Agent Sensors
Percepts
Environment
?
Actions
Actuators
Section 2.4. Simple
The Structure of Agents
Reflex Agents 49
there is to say about the agent. Mathematically speaking, we say that an agent’s behavior is
AGENT FUNCTION described by the agent Agent function that maps any given percept sequence to an action.
Sensors
agents, this would be a very large table—infinite, Whatin the fact,
world unless we place a bound on the
is like now
Environment
in principle, construct this table by trying out all possible percept sequences and recording
of the agent. Internally, the agent function for an artificial agent will be implemented by an
abstract mathematical Condition-action
description; rules
the agent program is aI concrete implementation, running
What action
should do now
Actuators
shown in Figure 2.2. This world is so simple that we can describe everything that happens;
it’s also a made-up world, so we can invent many variations. This particular world has just two
Figuresquares
locations: 2.9 Schematic
A and B.diagram of a simple
The vacuum agentreflex agent. which square it is in and whether
perceives
Reflex
there isAgents
dirt in with State It can choose to move left, move right, suck up the dirt, or do
the square.
suck; otherwise,
function S IMPLEmove to the-Aother
-R EFLEX GENTsquare.
( perceptA partial tabulation
) returns an action of this agent function is shown
persistent:
in Figure 2.3 and an agent
rules, program
a set of that implements
condition–action rules it appears in Figure 2.8 on page 48.
Looking
state at Figure-I2.3,
← I NTERPRET we see that various vacuum-world agents can be defined simply
NPUT( percept )
by filling
rule ←in Rthe
ULEright-hand column
-M ATCH(state, in various ways. The obvious question, then, is this: What
rules)
is theaction
right ←way to fill
rule.ACTION out the table? In other words, what makes an agent good or bad,
intelligent
returnor stupid? We answer these questions in the next section.
action
the current
show later state, as
in this chapter defined
that byvery
it can be the intelligent.
percept.
trivial; it gets more interesting shortly.) We use rectangles to denote the current internal state
of the agent’s decision process, and ovals to represent the background information used in
the process. The agent program, which is also very simple, is shown in Figure 2.10. The
I NTERPRET-I NPUT function generates an abstracted description of the current state from the
Section 2.4. The Structure of Agents 51 681
Sensors
State
is like now
Environment
What my actions do

should do now
Agent Actuators
52 Figure 2.11 A model-based reflex agent. Chapter 2. Intelligent Agents

Goal-Based Agents

Sensors
State
What the world
rules, aHow
setthe
of world
condition–action
evolves rules is like now
Environment
state ← U PDATE -S What (state,
TATEmy action
actions do , percept ,ifmodel ) A
I do action
return action
What action I
Goals should do now
Agent Actuators
is responsible for creating

Figure 2.13 the newgoal-based
A model-based, internal state description.
agent. The
It keeps track details
of the ofstate
world howasmodels
well as and
54 statesa are represented vary widely depending on the Chapter 2.
type ofthat
environmentIntelligent Agents
and lead
the particular
Utility-Based
set of Agent
goals it is trying to achieve, and chooses an action will (eventually) to the
technology used of
achievement in its
thegoals.
agent design. Detailed examples of models and updating algorithms
Regardless of the kind of representation used, Sensorsit is seldom possible for the agent to
example, the
determine thetaxi may state
current be driving back home,
of a partially
State
and it may
observable have a rule
environment telling Instead,
exactly. it to fill up
thewith
box
gas on the way home
labeled “what the world unless it
is world has
like now” at least half a tank. Although “driving back home” may
evolves (Figure 2.11) represents the agent’s “best guess” (or
What the world
How the
seem to an best
sometimes aspect of the world
guesses). state, thean
For example, fact of the
automated taxi’s
is like
may not beisable
taxidestination
now actually
to seeanaround
aspect the
of
Environment
the agent’s internal state.

large truck that has stopped If you find this puzzling, consider that the taxi could be in exactly
What myinactions
frontdoof it and canWhatonly
it will guess
be like about what may be causing the
the same Thus,
hold-up. place uncertainty
at the same time, but
about the intending toif reach
current state
I do action
may be a different
A destination.
unavoidable, but the agent still has
to make a decision. How happy I will be
2.4.4A perhaps
Goal-based agents
less obvious
Utility
point about the internal in such a state
“state” maintained by a model-based
agent
Knowingis that it does not
something have
about thetocurrent
describe “what
state of thethe world
environment
What action I
is like now”
is not in a enough
always literal sense. For
to decide
should do now
what to do. For example, at a road junction, the taxi can turn left, turn right, or go straight
on. The correct decision
Agent depends on where the taxiActuators is trying to get to. In other words, as well
GOAL as a current state description, the agent needs some sort of goal information that describes
situations that are desirable—for example, being at the passenger’s destination. The agent
Learning Agents
basedaction
reflex agent) to choose actions that achieve the goal. Figure 2.13 shows the goal-based
that leads to the best expected utility, where expected utility is computed by averaging
Sometimes goal-based action selection is straightforward—for example, when goal sat-
outcome. (Appendix
example, when A defines
the agent has toexpectation
consider longmore precisely.)
sequences In Chapter
of twists 16, we
and turns show to
in order that any
find a
rational agent must
way to achieve behave
the goal. it possesses
as if (Chapters
Search 3 toa 5)
utility
and function
planningwhose expected
(Chapters 10 andvalue it tries
11) are the
to maximize.
subfields of AIAn agent that
devoted possesses
to finding an sequences
action explicit utility
that function
achieve thecanagent’s
make rational
goals. decisions
with aNotice
general-purpose algorithm
that decision making that does
of this notisdepend
kind on the specific
fundamentally differentutility function
from the being
condition–
maximized. In this way,
action rules described the “global”
earlier, in that itdefinition of rationality—designating
involves consideration of the future—both as rational
“Whatthose
will
agent
happenfunctions that have the highest
if I do such-and-such?” and “Willperformance—is turned into
that make me happy?” a “local”
In the constraint
reflex agent on
designs,
rational-agent
this information designs
is notthat can be expressed
explicitly in abecause
represented, simple theprogram.
built-in rules map directly from
Section
682 2.4. The Structure of Agents 55 AI 1/2?
CHAPTER 34. WHAT DID WE LEARN IN
Critic Sensors
feedback
Environment
changes
element element
knowledge
learning
goals
Problem
generator
Actuators
Agent

He estimates how much work this might take and concludes “Some more expeditious method
Rational
them. InAgentmany areas of AI, this is now the preferred method for creating state-of-the-art
in initially
Idea: Tryunknown
to design environments
agents that andare
to become
successfulmore competent than(do its initial knowledge
the right thing)
alone might allow. In this section, we briefly introduce the main ideas of learning agents.
Throughout 34.0.1.
Definition the book, An we agent
comment on opportunities
is called rational, ifandit methods
chooses for learning in
whichever particular
action max-
kinds of agents. Part V goes into much more depth on the learning
imizes the expected value of the performance measure given the percept sequence algorithms themselves.
to date.A learning agent can
This is called thebeMEU divided into four conceptual components, as shown in Fig-
principle.
LEARNING ELEMENT ure 2.15. The most important distinction is between the learning element, which is re-
PERFORMANCE
ELEMENT Note:
sponsibleAfor makingagent
rational improvements,
need notand bethe performance element, which is responsible for
perfect
selecting external actions. The performance element is what we have previously considered
to only
be theneeds to maximize
entire agent: it takes inexpected
percepts value
and decides on actions. The (rational
learning omniscient)
̸=element uses
CRITIC feedback
need from critic on
notthepredict e.g.how verytheunlikely
agent is butdoing and determines
catastrophic how in
events thetheperformance
future
percepts
The designmayofnot supply all
the learning relevant
element information
depends very much on the(Rational
design of the clairvoyant)
̸= performance
element.
if weWhen tryingperceive
cannot to design things
an agentwe that
dolearns a certain
not need capability,
to react the first question is
to them.
not “How am I going to get it to learn this?” but
but we may need to try to find out about hidden dangers
“What kind of performance element will my
(exploration)
action
can outcomes
be constructed may not
to improve be as
every partexpected
of the agent. (rational ̸= successful)
but we may need to take action to ensure that they dowith
The critic tells the learning element how well the agent is doing (morerespect to a fixed
often)
performance standard.
(learning) The critic is necessary because the percepts themselves provide no
Rational
indicating; thatexploration, learning,
it has checkmated autonomy
its opponent, but it needs a performance standard to know
Thinking, Fast and Slow (two Brain systems)

In his 2011 Bestseller Thinking, fast and slow [Kah11], David Kahnemann posits
a dichotomy between two modes of thought:
“System 1” is fast, instinctive and emotional;
“System 2” is slower, more deliberative, and more logical.
683
System 1 can
see whether an object is near or far
complete the phrase war and . . .
display disgust when seeing a gruesome
image
solve 2+2=?
read text on a billboard System 2
System 2 can
look out for the woman with the grey hair
sustain a higher than normal walking rate
count the number of A’s in a certain text System 1
give someone your phone number
park into a tight parking space
solve 17 × 24
determine the validity of a complex argu-
ment
Thinking, Fast and Slow (two AI systems)

System 1 and subsymbolic AI interface well. System 2 and symbolic AI interface
well.
critically
guides
System 2intentionally controls

Symbolic AI
massively
extends reach
System 1 Subsymbolic AI
unintentionally influences
System 1 subsymbolic AI
low attention level low transparency

short term desires low interactivity
little to no reflection low accountability
microdecisions rudimentary theory of mind
unintended influence
System 2 symbolic AI
high attention level high transparency

stable convictions high interactivity
high level of reflection high accountability
macro-decisions advanced theory of mind
Symbolic AI: Adding Knowledge to Algorithms

Problem Solving (Black Box States, Transitions, Heuristics)
Framework: Problem Solving and Search (basic tree/graph walking)

Variant: Game playing (Adversarial Search) (Minimax + αβ-Pruning)
Constraint Satisfaction Problems (heuristic search over partial assignments)
States as partial variable assignments, transitions as assignment
Heuristics informed by current restrictions, constraint graph
Inference as constraint propagation (transferring possible values across arcs)
Describing world states by formal language (and drawing inferences)
Propositional logic and DPLL (deciding entailment efficiently)
First-order logic and ATP (reasoning about infinite domains)
Digression: Logic programming (logic + search)
Description logics as moderately expressive, but decidable logics
Planning: Problem Solving using white-box world/action descriptions
Framework: describing world states in logic as sets of propositions and actions

by preconditions and add/delete lists
Algorithms: e.g heuristic search by problem relaxations

Uncertainty
685

Statistical AI: Adding Uncertainty and Learning

Problem Solving under Uncertainty (non-observable environment, stochastic
states)
Framework: Probabilistic Inference: Conditional Probabilities/Independence
Intuition: Reasoning in Belief Space instead of State Space!
Implementation: Bayesian Networks (exploit conditional independence)
Extension: Utilities and Decision Theory (for static/episodic environments)
Problem Solving in Sequential Worlds:
Framework: Markov Processes, transition models
Extension: MDPs, POMDPs (+ utilities/decisions)
Implementation: Dynamic Bayesian Networks
Machine Learning: adding optimization in changing environments (unsupervised)
Framework: Learning from Observations (positive/negative examples)
Intuitions: finding consistent/optimal hypotheses in a hypothesis space
Problems: consistency, expressivity, under/overfitting, computational/data re-
sources.
Extensions
knowledge in learning (based on logical methods)
statistical learning (optimizing the probability distribution over hypspace,
learning BNs)
Communication
Topics of AI-3 – A Course not taught at FAU /

Machine Learning
Theory and Practice of Deep Learning
More Reinforcement Learning

Communicating, Perceiving, and Acting
More NLP
Natural Language Semantics/Pragmatics
Perception
Robotics
Emotions, Sentiment Analysis
The Good News: All is not lost

There are tons of specialized courses at FAU (more as we speak)
Russell/Norvig’s AIMA [RN09] cover some of them as well!

Bibliography
[Aus62] John Langshaw Austin. How to do things with words. William James Lectures. Oxford
University Press, 1962.
[Bac00] Fahiem Bacchus. Subset of PDDL for the AIPS2000 Planning Competition. The AIPS-
00 Planning Competition Comitee. 2000.
[BF95] Avrim L. Blum and Merrick L. Furst. “Fast planning through planning graph analysis”.
In: Proceedings of the 14th International Joint Conference on Artificial Intelligence
(IJCAI). Ed. by Chris S. Mellish. Montreal, Canada: Morgan Kaufmann, San Mateo,
CA, 1995, pp. 1636–1642.
[BF97] Avrim L. Blum and Merrick L. Furst. “Fast planning through planning graph analysis”.
In: Artificial Intelligence 90.1-2 (1997), pp. 279–298.
[BG01] Blai Bonet and Héctor Geffner. “Planning as Heuristic Search”. In: Artificial Intelli-
gence 129.1–2 (2001), pp. 5–33.
[BG99] Blai Bonet and Héctor Geffner. “Planning as Heuristic Search: New Results”. In:
Proceedings of the 5th European Conference on Planning (ECP’99). Ed. by S. Biundo
and M. Fox. Springer-Verlag, 1999, pp. 60–72.
[BKS04] Paul Beame, Henry A. Kautz, and Ashish Sabharwal. “Towards Understanding and
Harnessing the Potential of Clause Learning”. In: Journal of Artificial Intelligence
Research 22 (2004), pp. 319–351.
[Bon+12] Blai Bonet et al., eds. Proceedings of the 22nd International Conference on Automated
Planning and Scheduling (ICAPS’12). AAAI Press, 2012.
[Bro90] Rodney Brooks. In: Robotics and Autonomous Systems 6.1–2 (1990), pp. 3–15. doi:
10.1016/S0921-8890(05)80025-9.
[Byl94] Tom Bylander. “The Computational Complexity of Propositional STRIPS Planning”.
In: Artificial Intelligence 69.1–2 (1994), pp. 165–204.
[Cho65] Noam Chomsky. Syntactic structures. Den Haag: Mouton, 1965.
[CKT91] Peter Cheeseman, Bob Kanefsky, and William M. Taylor. “Where the Really Hard
Problems Are”. In: Proceedings of the 12th International Joint Conference on Artificial
Intelligence (IJCAI). Ed. by John Mylopoulos and Ray Reiter. Sydney, Australia:
Morgan Kaufmann, San Mateo, CA, 1991, pp. 331–337.
[CM85] Eugene Charniak and Drew McDermott. Introduction to Artificial Intelligence. Ad-
dison Wesley, 1985.
[CQ69] Allan M. Collins and M. Ross Quillian. “Retrieval time from semantic memory”. In:
Journal of verbal learning and verbal behavior 8.2 (1969), pp. 240–247. doi: 10.1016/
S0022-5371(69)80069-1.
[Dav67] Donald Davidson. “Truth and Meaning”. In: Synthese 17 (1967).
[DCM12] DCMI Usage Board. DCMI Metadata Terms. DCMI Recommendation. Dublin Core
Metadata Initiative, June 14, 2012. url: http : / / dublincore . org / documents /
2012/06/14/dcmi-terms/.
687
688 BIBLIOGRAPHY
[DF31] B. De Finetti. “Sul significato soggettivo della probabilita”. In: Fundamenta Mathe-
maticae 17 (1931), pp. 298–329.
[DHK15] Carmel Domshlak, Jörg Hoffmann, and Michael Katz. “Red-Black Planning: A New
Systematic Approach to Partial Delete Relaxation”. In: Artificial Intelligence 221
(2015), pp. 73–114.
[Ede01] Stefan Edelkamp. “Planning with Pattern Databases”. In: Proceedings of the 6th Eu-
ropean Conference on Planning (ECP’01). Ed. by A. Cesta and D. Borrajo. Springer-
Verlag, 2001, pp. 13–24.
[FD14] Zohar Feldman and Carmel Domshlak. “Simple Regret Optimization in Online Plan-
ning for Markov Decision Processes”. In: Journal of Artificial Intelligence Research
51 (2014), pp. 165–205.
[Fis] John R. Fisher. prolog :- tutorial. url: https://www.cpp.edu/~jrfisher/www/
prolog_tutorial/ (visited on 10/10/2019).
[FL03] Maria Fox and Derek Long. “PDDL2.1: An Extension to PDDL for Expressing Tem-
poral Planning Domains”. In: Journal of Artificial Intelligence Research 20 (2003),
pp. 61–124.
[Fla94] Peter Flach. Wiley, 1994. isbn: 0471 94152 2. url: https://github.com/simply-
logical/simply-logical/releases/download/v1.0/SL.pdf.
[FN71] Richard E. Fikes and Nils Nilsson. “STRIPS: A New Approach to the Application of
Theorem Proving to Problem Solving”. In: Artificial Intelligence 2 (1971), pp. 189–
208.
[Gen34] Gerhard Gentzen. “Untersuchungen über das logische Schließen I”. In: Mathematische
Zeitschrift 39.2 (1934), pp. 176–210.
[Ger+09] Alfonso Gerevini et al. “Deterministic planning in the fifth international planning
competition: PDDL3 and experimental evaluation of the planners”. In: Artificial In-
telligence 173.5-6 (2009), pp. 619–668.
[GJ79] Michael R. Garey and David S. Johnson. Computers and Intractability—A Guide to
the Theory of NP-Completeness. BN book: Freeman, 1979.
[Glo] Grundlagen der Logik in der Informatik. Course notes at https://www8.cs.fau.de/
_media/ws16:gloin:skript.pdf. url: https://www8.cs.fau.de/_media/ws16:
gloin:skript.pdf (visited on 10/13/2017).
[GNT04] Malik Ghallab, Dana Nau, and Paolo Traverso. Automated Planning: Theory and
Practice. Morgan Kaufmann, 2004.
[GS05] Carla Gomes and Bart Selman. “Can get satisfaction”. In: Nature 435 (2005), pp. 751–
752.
[GSS03] Alfonso Gerevini, Alessandro Saetti, and Ivan Serina. “Planning through Stochas-
tic Local Search and Temporal Action Graphs”. In: Journal of Artificial Intelligence
Research 20 (2003), pp. 239–290.
[Hau85] John Haugeland. Artificial intelligence: the very idea. Massachusetts Institute of Tech-
nology, 1985.
[HD09] Malte Helmert and Carmel Domshlak. “Landmarks, Critical Paths and Abstractions:
What’s the Difference Anyway?” In: Proceedings of the 19th International Conference
on Automated Planning and Scheduling (ICAPS’09). Ed. by Alfonso Gerevini et al.
AAAI Press, 2009, pp. 162–169.
[HE05] Jörg Hoffmann and Stefan Edelkamp. “The Deterministic Part of IPC-4: An Overview”.
In: Journal of Artificial Intelligence Research 24 (2005), pp. 519–579.
[Hel06] Malte Helmert. “The Fast Downward Planning System”. In: Journal of Artificial In-
telligence Research 26 (2006), pp. 191–246.
BIBLIOGRAPHY 689
[Her+13a] Ivan Herman et al. RDF 1.1 Primer (Second Edition). Rich Structured Data Markup
for Web Documents. W3C Working Group Note. World Wide Web Consortium (W3C),
2013. url: http://www.w3.org/TR/rdfa-primer.
[Her+13b] Ivan Herman et al. RDFa 1.1 Primer – Second Edition. Rich Structured Data Markup
for Web Documents. W3C Working Goup Note. World Wide Web Consortium (W3C),
Apr. 19, 2013. url: http://www.w3.org/TR/xhtml-rdfa-primer/.
[HG00] Patrik Haslum and Hector Geffner. “Admissible Heuristics for Optimal Planning”. In:
Proceedings of the 5th International Conference on Artificial Intelligence Planning
Systems (AIPS’00). Ed. by S. Chien, R. Kambhampati, and C. Knoblock. Brecken-
ridge, CO: AAAI Press, Menlo Park, 2000, pp. 140–149.
[HG08] Malte Helmert and Hector Geffner. “Unifying the Causal Graph and Additive Heuris-
tics”. In: Proceedings of the 18th International Conference on Automated Planning
and Scheduling (ICAPS’08). Ed. by Jussi Rintanen et al. AAAI Press, 2008, pp. 140–
147.
[HHH07] Malte Helmert, Patrik Haslum, and Jörg Hoffmann. “Flexible Abstraction Heuristics
for Optimal Sequential Planning”. In: Proceedings of the 17th International Conference
on Automated Planning and Scheduling (ICAPS’07). Ed. by Mark Boddy, Maria
Fox, and Sylvie Thiebaux. Providence, Rhode Island, USA: Morgan Kaufmann, 2007,
pp. 176–183.
[Hit+12] Pascal Hitzler et al. OWL 2 Web Ontology Language Primer (Second Edition). W3C
Recommendation. World Wide Web Consortium (W3C), 2012. url: http://www.
w3.org/TR/owl-primer.
[HN01] Jörg Hoffmann and Bernhard Nebel. “The FF Planning System: Fast Plan Generation
Through Heuristic Search”. In: Journal of Artificial Intelligence Research 14 (2001),
pp. 253–302.
[Hof05] Jörg Hoffmann. “Where ‘Ignoring Delete Lists’ Works: Local Search Topology in Plan-
ning Benchmarks”. In: Journal of Artificial Intelligence Research 24 (2005), pp. 685–
758.
[Hof11] Jörg Hoffmann. “Every806thing You Always Wanted to Know about Planning (But
Were Afraid to Ask)”. In: Proceedings of the 34th Annual German Conference on
Artificial Intelligence (KI’11). Ed. by Joscha Bach and Stefan Edelkamp. Vol. 7006.
Lecture Notes in Computer Science. Springer, 2011, pp. 1–13. url: http://fai.cs.
uni-saarland.de/hoffmann/papers/ki11.pdf.
[How60] R. A. Howard. Dynamic Programming and Markov Processes. MIT Press, 1960.
[ILD] 7. Constraints: Interpreting Line Drawings. url: https://www.youtube.com/watch?
v=l-tzjenXrvI&t=2037s (visited on 11/19/2019).
[JN33] E. S. Pearson J. Neyman. “IX. On the problem of the most efficient tests of statis-
tical hypotheses”. In: Philosophical Transactions of the Royal Society of London A:
Mathematical, Physical and Engineering Sciences 231.694-706 (1933), pp. 289–337.
doi: 10.1098/rsta.1933.0009.
[Kah11] Daniel Kahneman. Thinking, fast and slow. Penguin Books, 2011. isbn: 9780141033570.
[KC04] Graham Klyne and Jeremy J. Carroll. Resource Description Framework (RDF): Con-
cepts and Abstract Syntax. W3C Recommendation. World Wide Web Consortium
(W3C), Feb. 10, 2004. url: http://www.w3.org/TR/2004/REC- rdf- concepts-
20040210/.
[KD09] Erez Karpas and Carmel Domshlak. “Cost-Optimal Planning with Landmarks”. In:
Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJ-
CAI’09). Ed. by C. Boutilier. Pasadena, California, USA: Morgan Kaufmann, July
2009, pp. 1728–1733.
690 BIBLIOGRAPHY
[Kee74] R. L. Keeney. “Multiplicative utility functions”. In: Operations Research 22 (1974),

pp. 22–34.
[KHD13] Michael Katz, Jörg Hoffmann, and Carmel Domshlak. “Who Said We Need to Relax
all Variables?” In: Proceedings of the 23rd International Conference on Automated
Planning and Scheduling (ICAPS’13). Ed. by Daniel Borrajo et al. Rome, Italy: AAAI
Press, 2013, pp. 126–134.
[KHH12a] Michael Katz, Jörg Hoffmann, and Malte Helmert. “How to Relax a Bisimulation?”
In: Proceedings of the 22nd International Conference on Automated Planning and
Scheduling (ICAPS’12). Ed. by Blai Bonet et al. AAAI Press, 2012, pp. 101–109.
[KHH12b] Emil Keyder, Jörg Hoffmann, and Patrik Haslum. “Semi-Relaxed Plan Heuristics”.
In: Proceedings of the 22nd International Conference on Automated Planning and
Scheduling (ICAPS’12). Ed. by Blai Bonet et al. AAAI Press, 2012, pp. 128–136.
[KNS97] B. Kessler, G. Nunberg, and H. Schütze. “Automatic detection of text genre”. In:
CoRR cmp-lg/9707002 (1997).
[Koe+97] Jana Koehler et al. “Extending Planning Graphs to an ADL Subset”. In: Proceedings
of the 4th European Conference on Planning (ECP’97). Ed. by S. Steel and R. Alami.
Springer-Verlag, 1997, pp. 273–285. url: ftp://ftp.informatik.uni- freiburg.
de/papers/ki/koehler-etal-ecp-97.ps.gz.
[Koh08] Michael Kohlhase. “Using LATEX as a Semantic Markup Format”. In: Mathematics in
Computer Science 2.2 (2008), pp. 279–304. url: https://kwarc.info/kohlhase/
papers/mcs08-stex.pdf.
[Kow97] Robert Kowalski. “Algorithm = Logic + Control”. In: Communications of the Asso-
ciation for Computing Machinery 22 (1997), pp. 424–436.
[KS00] Jana Köhler and Kilian Schuster. “Elevator Control as a Planning Problem”. In: AIPS
2000 Proceedings. AAAI, 2000, pp. 331–338. url: https://www.aaai.org/Papers/
AIPS/2000/AIPS00-036.pdf.
[KS06] Levente Kocsis and Csaba Szepesvári. “Bandit Based Monte-Carlo Planning”. In:
Proceedings of the 17th European Conference on Machine Learning (ECML 2006). Ed.
by Johannes Fürnkranz, Tobias Scheffer, and Myra Spiliopoulou. Vol. 4212. LNCS.
Springer-Verlag, 2006, pp. 282–293.
[KS92] Henry A. Kautz and Bart Selman. “Planning as Satisfiability”. In: Proceedings of the
10th European Conference on Artificial Intelligence (ECAI’92). Ed. by B. Neumann.
Vienna, Austria: Wiley, Aug. 1992, pp. 359–363.
[KS98] Henry A. Kautz and Bart Selman. “Pushing the Envelope: Planning, Propositional
Logic, and Stochastic Search”. In: Proceedings of the Thirteenth National Conference
on Artificial Intelligence AAAI-96. MIT Press, 1998, pp. 1194–1201.
[Kur90] Ray Kurzweil. The Age of Intelligent Machines. MIT Press, 1990. isbn: 0-262-11121-7.
[Lee02] Lillian Lee. “Fast context-free grammar parsing requires fast Boolean matrix multi-
plication”. In: Journal of the ACM 49.1 (2002), pp. 1–15.
[LPN] Learn Prolog Now! url: http://lpn.swi-prolog.org/ (visited on 10/10/2019).
[LS93] George F. Luger and William A. Stubblefield. Artificial Intelligence: Structures and
Strategies for Complex Problem Solving. World Student Series. The Benjamin/Cum-
mings, 1993. isbn: 9780805347852.
[Luc96] Peter Lucas. “Knowledge Acquisition for Decision-theoretic Expert Systems”. In:
AISB Quarterly 94 (1996), pp. 23–33. url: https : / / www . researchgate . net /
publication/2460438_Knowledge_Acquisition_for_Decision-theoretic_Expert_
Systems.
[McD+98] Drew McDermott et al. The PDDL Planning Domain Definition Language. The AIPS-
98 Planning Competition Comitee. 1998.
BIBLIOGRAPHY 691
[Met+53] N. Metropolis et al. “Equations of state calculations by fast computing machines”. In:
Journal of Chemical Physics 21 (1953), pp. 1087–1091.
[Min] Minion - Constraint Modelling. System Web page at http://constraintmodelling.
org/minion/. url: http://constraintmodelling.org/minion/.
[MMS93] Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. “Building a
large annotated corpus of English: the penn treebank”. In: Computational Linguistics
19.2 (1993), pp. 313–330.
[MR91] John Mylopoulos and Ray Reiter, eds. Sydney, Australia: Morgan Kaufmann, San
Mateo, CA, 1991.
[MSL92] David Mitchell, Bart Selman, and Hector J. Levesque. “Hard and Easy Distributions
of SAT Problems”. In: Proceedings of the 10th National Conference of the American
Association for Artificial Intelligence (AAAI’92). San Jose, CA: MIT Press, 1992,
pp. 459–465.
[NHH11] Raz Nissim, Jörg Hoffmann, and Malte Helmert. “Computing Perfect Heuristics in
Polynomial Time: On Bisimulation and Merge-and-Shrink Abstraction in Optimal
Planning”. In: Proceedings of the 22nd International Joint Conference on Artificial
Intelligence (IJCAI’11). Ed. by Toby Walsh. AAAI Press/IJCAI, 2011, pp. 1983–
1990.
[Nor+18a] Emily Nordmann et al. Lecture capture: Practical recommendations for students and
lecturers. 2018. url: https://osf.io/huydx/download.
[Nor+18b] Emily Nordmann et al. Vorlesungsaufzeichnungen nutzen: Eine Anleitung für Studierende.
2018. url: https://osf.io/e6r7a/download.
[NS63] Allen Newell and Herbert Simon. “GPS, a program that simulates human thought”.
In: Computers and Thought. Ed. by E. Feigenbaum and J. Feldman. McGraw-Hill,
1963, pp. 279–293.
[NS76] Alan Newell and Herbert A. Simon. “Computer Science as Empirical Inquiry: Symbols
and Search”. In: Communications of the ACM 19.3 (1976), pp. 113–126. doi: 10.
1145/360018.360022.
[OWL09] OWL Working Group. OWL 2 Web Ontology Language: Document Overview. W3C
Recommendation. World Wide Web Consortium (W3C), Oct. 27, 2009. url: http:
//www.w3.org/TR/2009/REC-owl2-overview-20091027/.
[PD09] Knot Pipatsrisawat and Adnan Darwiche. “On the Power of Clause-Learning SAT
Solvers with Restarts”. In: Proceedings of the 15th International Conference on Princi-
ples and Practice of Constraint Programming (CP’09). Ed. by Ian P. Gent. Vol. 5732.
Lecture Notes in Computer Science. Springer, 2009, pp. 654–668.
[Pól73] George Pólya. How to Solve it. A New Aspect of Mathematical Method. Princeton
University Press, 1973.
[Pra+94] Malcolm Pradhan et al. “Knowledge Engineering for Large Belief Networks”. In:
Proceedings of the Tenth International Conference on Uncertainty in Artificial In-
telligence. UAI’94. Seattle, WA: Morgan Kaufmann Publishers Inc., 1994, pp. 484–
490. isbn: 1-55860-332-8. url: http://dl.acm.org/citation.cfm?id=2074394.
2074456.
[Pro] Protégé. Project Home page at http : / / protege . stanford . edu. url: http : / /
protege.stanford.edu.
[PRR97] G. Probst, St. Raub, and Kai Romhardt. Wissen managen. 4 (2003). Gabler Verlag,
1997.
[PS08] Eric Prud’hommeaux and Andy Seaborne. SPARQL Query Language for RDF. W3C
Recommendation. World Wide Web Consortium (W3C), Jan. 15, 2008. url: http:
//www.w3.org/TR/2008/REC-rdf-sparql-query-20080115/.
692 BIBLIOGRAPHY
[PW92] J. Scott Penberthy and Daniel S. Weld. “UCPOP: A Sound, Complete, Partial Order
Planner for ADL”. In: Principles of Knowledge Representation and Reasoning: Pro-
ceedings of the 3rd International Conference (KR-92). Ed. by B. Nebel, W. Swartout,
and C. Rich. Cambridge, MA: Morgan Kaufmann, Oct. 1992, pp. 103–114. url: ftp:
//ftp.cs.washington.edu/pub/ai/ucpop-kr92.ps.Z.
[RHN06] Jussi Rintanen, Keijo Heljanko, and Ilkka Niemelä. “Planning as satisfiability: parallel
plans and algorithms for plan search”. In: Artificial Intelligence 170.12-13 (2006),
pp. 1031–1080.
[Rin10] Jussi Rintanen. “Heuristics for Planning with SAT”. In: Proceeedings of the 16th In-
ternational Conference on Principles and Practice of Constraint Programming. 2010,
pp. 414–428.
[RN03] Stuart J. Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. 2nd ed.
Pearso n Education, 2003. isbn: 0137903952.
[RN09] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. 3rd.
Prentice Hall Press, 2009. isbn: 0136042597, 9780136042594.
[RN95] Stuart J. Russell and Peter Norvig. Artificial Intelligence — A Modern Approach.
Upper Saddle River, NJ: Prentice Hall, 1995.
[RW10] Silvia Richter and Matthias Westphal. “The LAMA Planner: Guiding Cost-Based
Anytime Planning with Landmarks”. In: Journal of Artificial Intelligence Research
39 (2010), pp. 127–177.
[RW91] S. J. Russell and E. Wefald. Do the Right Thing — Studies in limited Rationality.
MIT Press, 1991.
[Sea69] John R. Searle. Speech Acts: An Essay in the Philosophy of Language. Cambridge,
London: Cambridge University Press, 1969.
[Sil+16] David Silver et al. “Mastering the Game of Go with Deep Neural Networks and Tree
Search”. In: Nature 529 (2016), pp. 484–503. url: http://www.nature.com/nature/
journal/v529/n7587/full/nature16961.html.
[Smu63] Raymond M. Smullyan. “A Unifying Principle for Quantification Theory”. In: Proc.
Nat. Acad Sciences 49 (1963), pp. 828–832.
[SR14] Guus Schreiber and Yves Raimond. RDF 1.1 Primer. W3C Working Group Note.
World Wide Web Consortium (W3C), 2014. url: http://www.w3.org/TR/rdf-
primer.
[SR91] C. Samuelsson and M. Rayner. “Quantitative evaluation of explanation-based learning
as an optimization tool for a large-scale natural language system”. In: Proceedings of
the 12th International Joint Conference on Artificial Intelligence (IJCAI). Ed. by
John Mylopoulos and Ray Reiter. Sydney, Australia: Morgan Kaufmann, San Mateo,
CA, 1991, pp. 609–615.
[sTeX] sTeX: A semantic Extension of TeX/LaTeX. url: https://github.com/sLaTeX/
sTeX (visited on 05/11/2020).
[SWI] SWI Prolog Reference Manual. url: https://www.swi-prolog.org/pldoc/refman/
(visited on 10/10/2019).
[Tur50] Alan Turing. “Computing Machinery and Intelligence”. In: Mind 59 (1950), pp. 433–
460.
[Wal75] David Waltz. “Understanding Line Drawings of Scenes with Shadows”. In: The Psy-
chology of Computer Vision. Ed. by P. H. Winston. McGraw-Hill, 1975, pp. 1–19.
[WHI] Human intelligence — Wikipedia The Free Encyclopedia. url: https://en.wikipedia.
org/w/index.php?title=Human_intelligence (visited on 04/09/2018).
[Wit53] Ludwig Wittgenstein. Philosophical Investigations. Basil Blackwell, 1953. isbn: 0631119000.
Part VIII
Excursions
693
695
As this course is predominantly an overview over the topics of Artificial Intelligence, and not
about the theoretical underpinnings, we give the discussion about these as a “suggested readings”
part here.
696
Appendix A
Completeness of Calculi for

Propositional Logic
The next step is to analyze the two calculi for completeness. For that we will first give ourselves
a very powerful tool: the “model existence theorem” (??), which encapsulates the model-theoretic
part of completeness theorems. With that, completeness proofs – which are quite tedious otherwise
– become a breeze.
A.1 Abstract Consistency and Model Existence

We will now come to an important tool in the theoretical study of reasoning calculi: the “abstract
consistency”/“model existence” method. This method for analyzing calculi was developed by Jaako
Hintikka, Raymond Smullyan, and Peter Andrews in 1950-1970 as an encapsulation of similar
constructions that were used in completeness arguments in the decades before. The basis for
this method is Smullyan’s Observation [Smu63] that completeness proofs based on Hintikka sets
only certain properties of consistency and that with little effort one can obtain a generalization
“Smullyan’s Unifying Principle”.
The basic intuition for this method is the following: typically, a logical system L = ⟨L, K, |=⟩ has
multiple calculi, human-oriented ones like the natural deduction calculi and machine-oriented ones
like the automated theorem proving calculi. All of these need to be analyzed for completeness (as
a basic quality assurance measure).
A completeness proof for a calculus C for S typically comes in two parts: one analyzes C-
consistency (sets that cannot be refuted in C), and the other construct K-models for C-consistent
sets.
In this situtation the “abstract consistency”/“model existence” method encapsulates the model
construction process into a meta-theorem: the “model existence” theorem. This provides a set of
syntactic (“abstract consistency”) conditions for calculi that are sufficient to construct models.
With the model existence theorem it suffices to show that C-consistency is an abstract consis-
tency property (a purely syntactic task that can be done by a C-proof transformation argument)
to obtain a completeness result for C.
Model Existence (Overview)

Definition: Abstract consistency
Definition: Hintikka set (maximally abstract consistent)
Theorem: Hintikka sets are satisfiable
697
698 APPENDIX A. COMPLETENESS OF CALCULI FOR PROPOSITIONAL LOGIC
Theorem: If Φ is abstract consistent, then Φ can be extended to a Hintikka set.

Corollary: If Φ is abstract consistent, then Φ is satisfiable.
Application: Let C be a calculus, if Φ is C-consistent, then Φ is abstract consistent.
Corollary: C is complete.
The proof of the model existence theorem goes via the notion of a Hintikka set, a set of
formulae with very strong syntactic closure properties, which allow to read off models. Jaako
Hintikka’s original idea for completeness proofs was that for every complete calculus C and every
C-consistent set one can induce a Hintikka set, from which a model can be constructed. This can
be considered as a first model existence theorem. However, the process of obtaining a Hintikka set
for a C-consistent set Φ of sentences usually involves complicated calculus dependent constructions.
In this situation, Raymond Smullyan was able to formulate the sufficient conditions for the
existence of Hintikka sets in the form of “abstract consistency properties” by isolating the calculus
independent parts of the Hintikka set construction. His technique allows to reformulate Hintikka
sets as maximal elements of abstract consistency classes and interpret the Hintikka set construction
as a maximizing limit process.
To carry out the “model-existence”/“abstract consistency” method, we will first have to look at
the notion of consistency.
Consistency and refutability are very important notions when studying the completeness for
calculi; they form syntactic counterparts of satisfiability.
Consistency
Let C be a calculus,. . .
Definition A.1.1. Let C be a calculus, then a formula set Φ is called C-, if we can
derive a contradiction from it.
Definition A.1.2. We call a pair of formulae A and ¬A a contradiction.
So a set Φ is C-refutable, if C canderive a contradiction from it.
Definition A.1.3. Let C be a calculus, then a formula set Φ is called C-, iff there
is a formula B, that is not derivable from Φ in C.
Definition A.1.4. We call a calculus C reasonable, iff implication elimination and
conjunction introduction are admissible in C and A ∧ ¬A ⇒ B is a C-theorem.
Theorem A.1.5. C-inconsistency and C-refutability coincide for reasonable calculi.
It is very important to distinguish the syntactic C-refutability and C-consistency from satisfiability,
which is a property of formulae that is at the heart of semantics. Note that the former have the
calculus (a syntactic device) as a parameter, while the latter does not. In fact we should actually
say S-satisfiability, where ⟨L, K, |=⟩ is the current logical system.
Even the word “contradiction” has a syntactical flavor to it, it translates to “saying against
each other” from its Latin root.
Abstract Consistency
A.1. ABSTRACT CONSISTENCY AND MODEL EXISTENCE 699
Definition A.1.6. Let ∇ be a family of sets. We call ∇ closed under subsets, iff
for each Φ∈∇, all subsets Ψ ⊆ Φ are elements of ∇.
Notation: We will use Φ∗A for Φ ∪ {A}.
Definition A.1.7. A family ∇ of sets of propositional formulae is called an abstract

consistency class, iff it is closed under subsets, and for each Φ∈∇
∇c ) P ̸∈Φ or ¬P ̸∈Φ for P ∈V0
∇¬ ) ¬¬A∈Φ implies Φ∗A∈∇
∇∨ ) A ∨ B∈Φ implies Φ∗A∈∇ or Φ∗B∈∇
∇∧ ) ¬(A ∨ B)∈Φ implies Φ ∪ {¬A, ¬B}∈∇
Example A.1.8. The empty set is an abstract consistency class
Example A.1.9. The set {∅, {Q}, {P ∨ Q}, {P ∨ Q, Q}} is an abstract consistency
class
Example A.1.10. The family of satisfiable sets is an abstract consistency class.
So a family of sets (we call it a family, so that we do not have to say “set of sets” and we can
distinguish the levels) is an abstract consistency class, iff it fulfills five simple conditions, of which
the last three are closure conditions.
Think of an abstract consistency class as a family of “consistent” sets (e.g. C-consistent for some
calculus C), then the properties make perfect sense: They are naturally closed under subsets — if
we cannot derive a contradiction from a large set, we certainly cannot from a subset, furthermore,
∇c ) If both P ∈Φ and ¬P ∈Φ, then Φ cannot be “consistent”.
∇¬ ) If we cannot derive a contradiction from Φ with ¬¬A∈Φ then we cannot from Φ∗A, since they
are logically equivalent.
The other two conditions are motivated similarly. We will carry out the proof here, since it
gives us practice in dealing with the abstract consistency properties.
The main result here is that abstract consistency classes can be extended to compact ones. The
proof is quite tedious, but relatively straightforward. It allows us to assume that all abstract
consistency classes are compact in the first place (otherwise we pass to the compact extension).
Actually we are after abstract consistency classes that have an even stronger property than just
being closed under subsets. This will allow us to carry out a limit construction in the Hintikka
set extension argument later.
Compact Collections
Definition A.1.11. We call a collection ∇ of sets compact, iff for any set Φ we
have
Φ∈∇, iff Ψ∈∇ for every finite subset Ψ of Φ.
Lemma A.1.12. If ∇ is compact, then ∇ is closed under subsets.
Proof:
1. Suppose S ⊆ T and T ∈∇.
2. Every finite subset A of S is a finite subset of T .
3. As ∇ is compact, we know that A∈∇.
4. Thus S∈∇.
The property of being closed under subsets is a “downwards-oriented” property: We go from large
sets to small sets, compactness (the interesting direction anyways) is also an “upwards-oriented”
property. We can go from small (finite) sets to large (infinite) sets. The main application for the
compactness condition will be to show that infinite sets of formulae are in a family ∇ by testing
all their finite subsets (which is much simpler).
Compact Abstract Consistency Classes

Lemma A.1.13. Any abstract consistency class can be extended to a compact
one.
Proof:
1. We choose ∇′ :={Φ ⊆ wff0 (V0 )|every finite subset of Φ is in ∇}.
2. Now suppose that Φ∈∇. ∇ is closed under subsets, so every finite subset of
Φ is in ∇ and thus Φ∈∇′ . Hence ∇ ⊆ ∇′ .
3. Next let us show that each ∇ is compact.’
3.1. Suppose Φ∈∇′ and Ψ is an arbitrary finite subset of Φ.
3.2. By definition of ∇′ all finite subsets of Φ are in ∇ and therefore Ψ∈∇′ .
3.3. Thus all finite subsets of Φ are in ∇′ whenever Φ is in ∇′ .
3.4. On the other hand, suppose all finite subsets of Φ are in ∇′ .
3.5. Then by the definition of ∇′ the finite subsets of Φ are also in ∇, so Φ∈∇′ .
Thus ∇′ is compact.
4. Note that ∇′ is closed under subsets by the Lemma above.
5. Now we show that if ∇ satisfies ∇∗ , then ∇ satisfies ∇∗ .’
5.1. To show ∇c , let Φ∈∇′ and suppose there is an atom A, such that {A, ¬A} ⊆
Φ. Then {A, ¬A}∈∇ contradicting ∇c .
5.2. To show ∇¬ , let Φ∈∇′ and ¬¬A∈Φ, then Φ∗A∈∇′ .
5.2.1. Let Ψ be any finite subset of Φ∗A, and Θ:=(Ψ\{A})∗¬¬A.
5.2.2. Θ is a finite subset of Φ, so Θ∈∇.
5.2.3. Since ∇ is an abstract consistency class and ¬¬A∈Θ, we get Θ∗A∈∇
by ∇¬ .
5.2.4. We know that Ψ ⊆ Θ∗A and ∇ is closed under subsets, so Ψ∈∇.
5.2.5. Thus every finite subset Ψ of Φ∗A is in ∇ and therefore by definition
Φ∗A∈∇′ .
5.3. the other cases are analogous to ∇¬ .
Hintikka sets are sets of sentences with very strong analytic closure conditions. These are motivated
as maximally consistent sets i.e. sets that already contain everything that can be consistently
added to them.
∇-Hintikka Set
Definition A.1.14. Let ∇ be an abstract consistency class, then we call a set
H∈∇ a ∇ Hintikka Set, iff H is maximal in ∇, i.e. for all A with H∗A∈∇ we
already have A∈H.
A.1. ABSTRACT CONSISTENCY AND MODEL EXISTENCE 701
Theorem A.1.15 (Hintikka Properties). Let ∇ be an abstract consistency class

and H be a ∇-Hintikka set, then
Hc ) For all A∈wff0 (V0 ) we have A̸∈H or ¬A̸∈H
H¬ ) If ¬¬A∈H then A∈H
H∨ ) If A ∨ B∈H then A∈H or B∈H
H∧ ) If ¬(A ∨ B)∈H then ¬A, ¬B∈H
∇-Hintikka Set
Proof:
We prove the properties in turn
1. Hc by induction on the structure of A
1.1. A∈V0 Then A̸∈H or ¬A̸∈H by ∇c .
1.2. A = ¬B
1.2.1. Let us assume that ¬B∈H and ¬¬B∈H,
1.2.2. then H∗B∈∇ by ∇¬ , and therefore B∈H by maximality.
1.2.3. So both B and ¬B are in H, which contradicts the inductive hy-
pothesis.
1.3. A = B ∨ C similar to the previous case
2. We prove H¬ by maximality of H in ∇.
2.1. If ¬¬A∈H, then H∗A∈∇ by ∇¬ .
2.2. The maximality of H now gives us that A∈H.
Proof sketch: other H∗ are similar
The following theorem is one of the main results in the “abstract consistency”/”model existence”
method. For any abstract consistent set Φ it allows us to construct a Hintikka set H with Φ∈H.
Extension Theorem
Theorem A.1.16. If ∇ is an abstract consistency class and Φ∈∇, then there is a
∇-Hintikka set H with Φ ⊆ H.
Proof:
1. Wlog. we assume that ∇ is compact (otherwise pass to compact extension)
2. We choose an enumeration A1 , . . . of the set wff0 (V0 )
3. and construct a sequence of sets Hi with H0 :=Φ and

Hn if (Hn ∗An ̸∈∇)
Hn+1 :=
Hn ∗An if (Hn ∗An ∈∇)
S
4. Note that all Hi ∈∇, choose H:= i∈N Hi
5. Ψ ⊆ H finite implies there is a j∈N such that Ψ ⊆ Hj ,
6. so Ψ∈∇ as ∇ closed under subsets and H∈∇ as ∇ is compact.
7. Let H∗B∈∇, then there is a j∈N with B = Aj , so that B∈Hj+1 and Hj+1 ⊆
H
8. Thus H is ∇-maximal
Note that the construction in the proof above is non-trivial in two respects. First, the limit
construction for H is not executed in our original abstract consistency class ∇, but in a suitably
extended one to make it compact — the original would not have contained H in general. Second,
the set H is not unique for Φ, but depends on the choice of the enumeration of wff0 (V0 ). If we pick a
different enumeration, we will end up with a different H. Say if A and ¬A are both ∇-consistent2
with Φ, then depending on which one is first in the enumeration H, will contain that one; with all
the consequences for subsequent choices in the construction process.
Valuation
Definition A.1.17. A function ν : wff0 (V0 )→Do is called a valuation, iff
ν(¬A) = T, iff ν(A) = F
ν(A ∧ B) = T, iff ν(A) = T and ν(B) = T
Lemma A.1.18. If ν : wff0 (V0 )→Do is a valuation and Φ ⊆ wff0 (V0 ) with ν(Φ) =
{T}, then Φ is satisfiable.
Proof sketch: ν|V0 : V0 →Do is a satisfying variable assignment.
Lemma A.1.19. If φ : V0 →Do is a variable assignment, then I φ : wff0 (V0 )→Do is

a valuation.
Now, we only have to put the pieces together to obtain the model existence theorem we are after.
Model Existence
Lemma A.1.20 (Hintikka-Lemma). If ∇ is an abstract consistency class and H
a ∇-Hintikka set, then H is satisfiable.
Proof:
1. We define ν(A):=T, iff A∈H
2. then ν is a valuation by the Hintikka properties
3. and thus ν|V0 is a satisfying assignment.
Theorem A.1.21 (Model Existence). If ∇ is an abstract consistency class and
Φ∈∇, then Φ is satisfiable.
Proof:
1. There is a ∇-Hintikka set H with Φ ⊆ H (Extension Theorem)
2. We know that H is satisfiable. (Hintikka-Lemma)
3. In particular, Φ ⊆ H is satisfiable.
2 EdNote: introduce this above

A.2. A COMPLETENESS PROOF FOR PROPOSITIONAL TABLEAUX 703
A.2 A Completeness Proof for Propositional Tableaux

With the model existence proof we have introduced in the last section, the completeness proof for
first-order natural deduction is rather simple, we only have to check that Tableaux-consistency is
an abstract consistency property.
We encapsulate all of the technical difficulties of the problem in a technical Lemma. From that,
the completeness proof is just an application of the high-level theorems we have just proven.
Abstract Completeness for T0
Lemma A.2.1. {Φ|ΦT has no closed tableau} is an abstract consistency class.

Proof: Let’s call the set above ∇
We have to convince ourselves of the abstract consistency properties
1. ∇c P , ¬P ∈Φ implies P F , P T ∈ΦT .
2. ∇¬ Let ¬¬A∈Φ.
2.1. For the proof of the contrapositive we assume that Φ∗A has a closed
tableau T and show that already Φ has one:
2.2. applying each of T0 ¬T and T0 ¬F once allows to extend any tableau with
¬¬Bα by Bα .
2.3. any path in T that is closed with ¬¬Aα , can be closed by Aα .
3. ∇∨ Suppose A ∨ B∈Φ and both Φ∗A and Φ∗B have closed tableaux
3.1. consider the tableaux:
ΨT
ΦT ΦT T
(A ∨ B)
AT BT T
Rest1 Rest2 A BT
Rest Rest2
1
4. ∇∧ suppose, ¬(A ∨ B)∈Φ and Φ{¬A, ¬B} have closed tableau T .

4.1. We consider
ΨT
ΦT F
(A ∨ B)
AF F
A
BF
Rest BF
Rest
where Φ = Ψ∗¬(A ∨ B).
Observation: If we look at the completeness proof below, we see that the Lemma above is the
only place where we had to deal with specific properties of the T0 .
So if we want to prove completeness of any other calculus with respect to propositional logic,
then we only need to prove an analogon to this lemma and can use the rest of the machinery we
have already established “off the shelf”.
This is one great advantage of the “abstract consistency method”; the other is that the method
can be extended transparently to other logics.
Completeness of T0
Corollary A.2.2. T0 is complete.

Proof: by contradiction
1. We assume that A∈wff0 (V0 ) is valid, but there is no closed tableau for AF .
2. We have {¬A}∈∇ as ¬AT = AF .
3. so ¬A is satisfiable by the model existence theorem (which is applicable as ∇
is an abstract consistency class by our Lemma above)
4. this contradicts our assumption that A is valid.

Appendix B
Completeness of Calculi for

First-Order Logic
We will now analyze the first-order calculi for completeness. Just as in the case of the propositional
calculi, we prove a model existence theorem for the first-order model theory and then use that
for the completeness proofs3 . The proof of the first-order model existence theorem is completely EdN:3
analogous to the propositional one; indeed, apart from the model construction itself, it is just an
extension by a treatment for the first-order quantifiers.4 EdN:4
B.1 Abstract Consistency and Model Existence

We will now come to an important tool in the theoretical study of reasoning calculi: the “abstract
consistency”/“model existence” method. This method for analyzing calculi was developed by Jaako
Hintikka, Raymond Smullyan, and Peter Andrews in 1950-1970 as an encapsulation of similar
constructions that were used in completeness arguments in the decades before. The basis for
this method is Smullyan’s Observation [Smu63] that completeness proofs based on Hintikka sets
only certain properties of consistency and that with little effort one can obtain a generalization
“Smullyan’s Unifying Principle”.
The basic intuition for this method is the following: typically, a logical system L = ⟨L, K, |=⟩ has
multiple calculi, human-oriented ones like the natural deduction calculi and machine-oriented ones
like the automated theorem proving calculi. All of these need to be analyzed for completeness (as
a basic quality assurance measure).
A completeness proof for a calculus C for S typically comes in two parts: one analyzes C-
consistency (sets that cannot be refuted in C), and the other construct K-models for C-consistent
sets.
In this situtation the “abstract consistency”/“model existence” method encapsulates the model
construction process into a meta-theorem: the “model existence” theorem. This provides a set of
syntactic (“abstract consistency”) conditions for calculi that are sufficient to construct models.
With the model existence theorem it suffices to show that C-consistency is an abstract consis-
tency property (a purely syntactic task that can be done by a C-proof transformation argument)
to obtain a completeness result for C.
Model Existence (Overview)

Definition: Abstract consistency
3 EdNote: reference the theorems

4 EdNote: MK: what about equality?
705
706 APPENDIX B. COMPLETENESS OF CALCULI FOR FIRST-ORDER LOGIC
Definition: Hintikka set (maximally abstract consistent)

Theorem: Hintikka sets are satisfiable
Theorem: If Φ is abstract consistent, then Φ can be extended to a Hintikka set.
Corollary: If Φ is abstract consistent, then Φ is satisfiable.

Application: Let C be a calculus, if Φ is C-consistent, then Φ is abstract consistent.
Corollary: C is complete.
The proof of the model existence theorem goes via the notion of a Hintikka set, a set of
formulae with very strong syntactic closure properties, which allow to read off models. Jaako
Hintikka’s original idea for completeness proofs was that for every complete calculus C and every
C-consistent set one can induce a Hintikka set, from which a model can be constructed. This can
be considered as a first model existence theorem. However, the process of obtaining a Hintikka set
for a C-consistent set Φ of sentences usually involves complicated calculus dependent constructions.
In this situation, Raymond Smullyan was able to formulate the sufficient conditions for the
existence of Hintikka sets in the form of “abstract consistency properties” by isolating the calculus
independent parts of the Hintikka set construction. His technique allows to reformulate Hintikka
sets as maximal elements of abstract consistency classes and interpret the Hintikka set construction
as a maximizing limit process.
To carry out the “model-existence”/“abstract consistency” method, we will first have to look at
the notion of consistency.
Consistency and refutability are very important notions when studying the completeness for
calculi; they form syntactic counterparts of satisfiability.
Consistency
Let C be a calculus,. . .
Definition B.1.1. Let C be a calculus, then a formula set Φ is called C-, if we can
derive a contradiction from it.
Definition B.1.2. We call a pair of formulae A and ¬A a contradiction.
So a set Φ is C-refutable, if C canderive a contradiction from it.

Definition B.1.3. Let C be a calculus, then a formula set Φ is called C-, iff there
is a formula B, that is not derivable from Φ in C.
Definition B.1.4. We call a calculus C reasonable, iff implication elimination and
conjunction introduction are admissible in C and A ∧ ¬A ⇒ B is a C-theorem.
Theorem B.1.5. C-inconsistency and C-refutability coincide for reasonable calculi.
It is very important to distinguish the syntactic C-refutability and C-consistency from satisfiability,
which is a property of formulae that is at the heart of semantics. Note that the former have the
calculus (a syntactic device) as a parameter, while the latter does not. In fact we should actually
say S-satisfiability, where ⟨L, K, |=⟩ is the current logical system.
Even the word “contradiction” has a syntactical flavor to it, it translates to “saying against
each other” from its Latin root.
B.1. ABSTRACT CONSISTENCY AND MODEL EXISTENCE 707
The notion of an “abstract consistency class” provides the a calculus-independent notion of con-
sistency: A set Φ of sentences is considered “consistent in an abstract sense”, iff it is a member of
an abstract consistency class ∇.
Abstract Consistency
Definition B.1.6. Let ∇ be a family of sets. We call ∇ closed under subsets, iff
for each Φ∈∇, all subsets Ψ ⊆ Φ are elements of ∇.
Notation: We will use Φ∗A for Φ ∪ {A}.
Definition B.1.7. A family ∇ ⊆ wff o (Σι , Vι ) of sets of formulae is called a (first-

order) abstract consistency class, iff it is closed under subsets, and for each Φ∈∇
∇c ) A̸∈Φ or ¬A̸∈Φ for atomic A∈wff o (Σι , Vι ).
∇¬ ) ¬¬A∈Φ implies Φ∗A∈∇
∇∧ ) A ∧ B∈Φ implies Φ ∪ {A, B}∈∇
∇∨ ) ¬(A ∧ B)∈Φ implies Φ∗¬A∈∇ or Φ∗¬B∈∇
∇∀ ) If ∀X A∈Φ, then Φ∗([B/X](A))∈∇ for each closed term B.
∇∃ ) If ¬(∀X A)∈Φ and c is an individual constant that does not occur in Φ, then
Φ∗¬([c/X](A))∈∇
The conditions are very natural: Take for instance ∇c , it would be foolish to call a set Φ of
sentences “consistent under a complete calculus”, if it contains an elementary contradiction. The
next condition ∇¬ says that if a set Φ that contains a sentence ¬¬A is “consistent”, then we should
be able to extend it by A without losing this property; in other words, a complete calculus should
be able to recognize A and ¬¬A to be equivalent. We will carry out the proof here, since it
gives us practice in dealing with the abstract consistency properties.
The main result here is that abstract consistency classes can be extended to compact ones. The
proof is quite tedious, but relatively straightforward. It allows us to assume that all abstract
consistency classes are compact in the first place (otherwise we pass to the compact extension).
Actually we are after abstract consistency classes that have an even stronger property than just
being closed under subsets. This will allow us to carry out a limit construction in the Hintikka
set extension argument later.
Compact Collections
Definition B.1.8. We call a collection ∇ of sets compact, iff for any set Φ we
have
Φ∈∇, iff Ψ∈∇ for every finite subset Ψ of Φ.
Lemma B.1.9. If ∇ is compact, then ∇ is closed under subsets.

Proof:
1. Suppose S ⊆ T and T ∈∇.
2. Every finite subset A of S is a finite subset of T .
3. As ∇ is compact, we know that A∈∇.
4. Thus S∈∇.
The property of being closed under subsets is a “downwards-oriented” property: We go from large
sets to small sets, compactness (the interesting direction anyways) is also an “upwards-oriented”
property. We can go from small (finite) sets to large (infinite) sets. The main application for the
compactness condition will be to show that infinite sets of formulae are in a family ∇ by testing
all their finite subsets (which is much simpler).
Compact Abstract Consistency Classes

Lemma B.1.10. Any first-order abstract consistency class can be extended to a
compact one.
Proof:
1. We choose ∇′ :={Φ ⊆ cwff o (Σι )|every finite subset of Φis in ∇}.
2. Now suppose that Φ∈∇. ∇ is closed under subsets, so every finite subset of
Φ is in ∇ and thus Φ∈∇′ . Hence ∇ ⊆ ∇′ .
3. Let us now show that each ∇ is compact.’
3.1. Suppose Φ∈∇′ and Ψ is an arbitrary finite subset of Φ.
3.2. By definition of ∇′ all finite subsets of Φ are in ∇ and therefore Ψ∈∇′ .
3.3. Thus all finite subsets of Φ are in ∇′ whenever Φ is in ∇′ .
3.4. On the other hand, suppose all finite subsets of Φ are in ∇′ .
3.5. Then by the definition of ∇′ the finite subsets of Φ are also in ∇, so Φ∈∇′ .
Thus ∇′ is compact.
4. Note that ∇′ is closed under subsets by the Lemma above.
5. Next we show that if ∇ satisfies ∇∗ , then ∇ satisfies ∇∗ .’
5.1. To show ∇c , let Φ∈∇′ and suppose there is an atom A, such that {A, ¬A} ⊆
Φ. Then {A, ¬A}∈∇ contradicting ∇c .
5.2. To show ∇¬ , let Φ∈∇′ and ¬¬A∈Φ, then Φ∗A∈∇′ .
5.2.1. Let Ψ be any finite subset of Φ∗A, and Θ:=(Ψ\{A})∗¬¬A.
5.2.2. Θ is a finite subset of Φ, so Θ∈∇.
5.2.3. Since ∇ is an abstract consistency class and ¬¬A∈Θ, we get Θ∗A∈∇
by ∇¬ .
5.2.4. We know that Ψ ⊆ Θ∗A and ∇ is closed under subsets, so Ψ∈∇.
5.2.5. Thus every finite subset Ψ of Φ∗A is in ∇ and therefore by definition
Φ∗A∈∇′ .
5.3. the other cases are analogous to ∇¬ .
Hintikka sets are sets of sentences with very strong analytic closure conditions. These are motivated
as maximally consistent sets i.e. sets that already contain everything that can be consistently
added to them.
∇-Hintikka Set
Definition B.1.11. Let ∇ be an abstract consistency class, then we call a set
H∈∇ a ∇ Hintikka Set, iff H is maximal in ∇, i.e. for all A with H∗A∈∇ we
already have A∈H.
Theorem B.1.12 (Hintikka Properties). Let ∇ be an abstract consistency class
and H be a ∇-Hintikka set, then
B.1. ABSTRACT CONSISTENCY AND MODEL EXISTENCE 709
Hc ) For all A∈wff o (Σι , Vι ) we have A̸∈H or ¬A̸∈H.

H¬ ) If ¬¬A∈H then A∈H.
H∧ ) If A ∧ B∈H then A, B∈H.
H∨ ) If ¬(A ∧ B)∈H then ¬A∈H or ¬B∈H.
H∀ ) If ∀X A∈H, then [B/X](A)∈H for each closed term B.
H∃ ) If ¬(∀X A)∈H then ¬([B/X](A))∈H for some term closed term B.
Proof:
We prove the properties in turn Hc goes by induction on the structure of A
1. A atomic
1.1. Then A̸∈H or ¬A̸∈H by ∇c .
2. A = ¬B
2.1. Let us assume that ¬B∈H and ¬¬B∈H,
2.2. then H∗B∈∇ by ∇¬ , and therefore B∈H by maximality.
2.3. So {B, ¬B} ⊆ H, which contradicts the inductive hypothesis.
3. A = B ∨ C similar to the previous case
4. We prove H¬ by maximality of H in ∇.
4.1. If ¬¬A∈H, then H∗A∈∇ by ∇¬ .
4.2. The maximality of H now gives us that A∈H.
5. The other H∗ are similar
The following theorem is one of the main results in the “abstract consistency”/”model existence”
method. For any abstract consistent set Φ it allows us to construct a Hintikka set H with Φ∈H.
Extension Theorem
Theorem B.1.13. If ∇ is an abstract consistency class and Φ∈∇ finite, then there
is a ∇-Hintikka set H with Φ ⊆ H.
Proof:
1. Wlog. assume that ∇ compact (else use compact extension)
2. Choose an enumeration A1 , . . . of cwff o (Σι ) and c1 , . . . of Σsk
0 .
3. and construct a sequence of sets Hi with H0 :=Φ and

 Hn if (Hn ∗An ̸∈∇)
Hn+1 := Hn ∪ {An , ¬([cn /X](B))} if (Hn ∗An ∈∇) and An = (¬(∀X B))

Hn ∗An else
S
4. Note that all Hi ∈∇, choose H:= i∈N Hi
5. Ψ ⊆ H finite implies there is a j∈N such that Ψ ⊆ Hj ,
6. so Ψ∈∇ as ∇ closed under subsets and H∈∇ as ∇ is compact.
7. Let H∗B∈∇, then there is a j∈N with B = Aj , so that B∈Hj+1 and Hj+1 ⊆
H
8. Thus H is ∇-maximal
Note that the construction in the proof above is non-trivial in two respects. First, the limit
construction for H is not executed in our original abstract consistency class ∇, but in a suitably
extended one to make it compact — the original would not have contained H in general. Second,
the set H is not unique for Φ, but depends on the choice of the enumeration of cwff o (Σι ). If
we pick a different enumeration, we will end up with a different H. Say if A and ¬A are both
∇-consistent5 with Φ, then depending on which one is first in the enumeration H, will contain
that one; with all the consequences for subsequent choices in the construction process.
Valuations
Definition B.1.14. A function µ : cwff o (Σι )→Do is called a (first-order) valuation,
iff
µ(¬A) = T, iff µ(A) = F
µ(A ∧ B) = T, iff µ(A) = T and µ(B) = T
µ(∀X A) = T, iff µ([B/X](A)) = T for all closed terms B.
Lemma B.1.15. If φ : Vι →D is a variable assignment, then I φ : cwff o (Σι )→Do
is a valuation.
Proof sketch: Immediate from the definitions
Thus a valuation is a weaker notion of evaluation in first-order logic; the other direction is also
true, even though the proof of this result is much more involved: The existence of a first-order
valuation that makes a set of sentences true entails the existence of a model that satisfies it.6
Valuation and Satisfiability

Lemma B.1.16. If µ : cwff o (Σι )→Do is a valuation and Φ ⊆ cwff o (Σι ) with
µ(Φ) = {T}, then Φ is satisfiable.
Proof: We construct a model for Φ.
1. Let Dι :=cwff ι (Σι ), and
k
I(f ) : Dι →Dι ;⟨A1 , . . ., Ak ⟩7→f (A1 , . . ., Ak ) for f ∈Σ
f
k
I(p) : Dι →Do ;⟨A1 , . . ., Ak ⟩7→µ(p(A1 , . . ., Ak )) for p∈Σ .
p
2. Then variable assignments into Dι are ground substitutions.

3. We show I φ (A) = φ(A) for A∈wff ι (Σι , Vι ) by induction on A:
3.1. A = X
3.1.1. then I φ (A) = φ(X) by definition.
3.2. A = f (A1 , . . ., Ak )
3.2.1. then I φ (A) = I(f )(I φ (A1 ), . . . , I φ (An )) = I(f )(φ(A1 ), . . . , φ(An )) =
f (φ(A1 ), . . . , φ(An )) = φ(f (A1 , . . ., Ak )) = φ(A)
We show I φ (A) = µ(φ(A)) for A∈wff o (Σι , Vι ) by induction on A.
3.3. A = p(A1 , . . ., Ak )
3.3.1. then I φ (A) = I(p)(I φ (A1 ), . . . , I φ (An )) = I(p)(φ(A1 ), . . . , φ(An )) =
µ(p(φ(A1 ), . . . , φ(An ))) = µ(φ(p(A1 , . . ., Ak ))) = µ(φ(A))
3.4. A = ¬B
3.4.1. then I φ (A) = T, iff I φ (B) = µ(φ(B)) = F, iff µ(φ(A)) = T.
3.5. A = B ∧ C
3.5.1. similar
5 EdNote: introduce this above

6 EdNote: I think that we only get a semivaluation, look it up in Andrews.
B.2. A COMPLETENESS PROOF FOR FIRST-ORDER ND 711
3.6. A = ∀X B
3.6.1. then I φ (A) = T, iff I ψ (B) = µ(ψ(B)) = T, for all C∈Dι , where
ψ = φ,[C/X]. This is the case, iff µ(φ(A)) = T.
4. Thus I φ (A)µ(φ(A)) = µ(A) = T for all A∈Φ.
5. Hence M|=A for M:=⟨Dι , I⟩.
Now, we only have to put the pieces together to obtain the model existence theorem we are after.
Model Existence
Theorem B.1.17 (Hintikka-Lemma). If ∇ is an abstract consistency class and
H a ∇-Hintikka set, then H is satisfiable.
Proof:
1. we define µ(A):=T, iff A∈H,
2. then µ is a valuation by the Hintikka set properties.
3. We have µ(H) = {T}, so H is satisfiable.
Theorem B.1.18 (Model Existence). If ∇ is an abstract consistency class and
Φ∈∇, then Φ is satisfiable.
Proof:
1. There is a ∇-Hintikka set H with Φ ⊆ H (Extension Theorem)
2. We know that H is satisfiable. (Hintikka-Lemma)
3. In particular, Φ ⊆ H is satisfiable.
B.2 A Completeness Proof for First-Order ND

With the model existence proof we have introduced in the last section, the completeness proof
for first-order natural deduction is rather simple, we only have to check that ND-consistency is an
abstract consistency property.
Consistency, Refutability and Abstract Consistency

Theorem B.2.1 (Non-Refutability is an Abstract Consistency Property).
Γ:={Φ ⊆ cwff o (Σι )|Φ not ND1 −refutable} is an abstract consistency class.
Proof: We check the properties of an ACC

1. If Φ is non-refutable, then any subset is as well, so Γ is closed under subsets.
We show the abstract consistency conditions ∇∗ for Φ∈Γ.
2. ∇c
2.1. We have to show that A̸∈Φ or ¬A̸∈Φ for atomic A∈wff o (Σι , Vι ).
2.2. Equivalently, we show the contrapositive: If {A, ¬A} ⊆ Φ, then Φ̸∈Γ.
2.3. So let {A, ¬A} ⊆ Φ, then Φ is ND1 -refutable by construction.
2.4. So Φ̸∈Γ.
3. ∇¬ We show the contrapositive again
3.1. Let ¬¬A∈Φ and Φ∗A̸∈Γ

3.2. Then we have a refutation D : Φ∗A⊢ND1 F
3.3. By prepending an application of ¬E for ¬¬A to D, we obtain a refutation
D : Φ⊢ND1 F ′ .
3.4. Thus Φ̸∈Γ.
Proof sketch: other ∇∗ similar
This directly yields two important results that we will use for the completeness analysis.
Henkin’s Theorem
Corollary B.2.2 (Henkin’s Theorem). Every ND1 -consistent set of sentences
has a model.
Proof:
1. Let Φ be a ND1 -consistent set of sentences.
2. The class of sets of ND1 -consistent propositions constitute an abstract consis-
tency class.
3. Thus the model existence theorem guarantees a model for Φ.
Corollary B.2.3 (Löwenheim&Skolem Theorem). Satisfiable set Φ of first-order
sentences has a countable model.
Proof sketch: The model we constructed is countable, since the set of ground terms
is.
Now, the completeness result for first-order natural deduction is just a simple argument away.
We also get a compactness theorem (almost) for free: logical systems with a complete calculus are
always compact.
Completeness and Compactness

Theorem B.2.4 (Completeness Theorem for ND1 ). If Φ |= A, then Φ⊢ND1 A.
Proof: We prove the result by playing with negations.
1. If A is valid in all models of Φ, then Φ∗¬A has no model
2. Thus Φ∗¬A is inconsistent by (the contrapositive of) Henkins Theorem.
3. So Φ⊢ND1 ¬¬A by ¬I and thus Φ⊢ND1 A by ¬E.
Theorem B.2.5 (Compactness Theorem for first-order logic). If Φ |= A, then
there is already a finite set Ψ ⊆ Φ with Ψ |= A.
Proof: This is a direct consequence of the completeness theorem
1. We have Φ |= A, iff Φ⊢ND1 A.
2. As a proof is a finite object, only a finite subset Ψ ⊆ Φ can appear as leaves
in the proof.

B.3. SOUNDNESS AND COMPLETENESS OF FIRST-ORDER TABLEAUX 713
B.3 Soundness and Completeness of First-Order Tableaux

The soundness of the first-order free-variable tableaux calculus can be established a simple in-
duction over the size of the tableau.
Soundness of T1f
Lemma B.3.1. Tableau rules transform satisfiable tableaux into satisfiable ones.
Proof:
we examine the tableau rules in turn
1. propositional rules as in propositional tableaux
2. T1f ∃ by ??
3. T1f⊥ by ?? (substitution value lemma)
4. T1f ∀
4.1. I φ (∀X A) = T, iff I ψ (A) = T for all a∈Dι
4.2. so in particular for some a∈Dι ̸= ∅.
Corollary B.3.2. T1f is correct.
The only interesting steps are the cut rule, which can be directly handled by the substitution
value lemma, and the rule for the existential quantifier, which we do in a separate lemma.
Soundness of T1f ∃
Lemma B.3.3. T1f ∃ transforms satisfiable tableaux into satisfiable ones.
Proof: Let T ′ be obtained by applying T1f ∃ to (∀X A) in T , extending it with

F
F
([f (X 1 , . . ., X k )/X](A)) , where W :=free(∀X A) = {X 1 , . . ., X k }
1. Let T be satisfiable in M:=⟨D, I⟩, then I φ (∀X A) = F.
We need to find a model M′ that satisfies T ′ (find interpretation for f )
2. By definition I φ,[a/X] (A) = F for some a∈D (depends on φ|W )
3. Let g : D →D be defined by g(a1 , . . ., ak ):=a, if φ(X ) = ai
k i
4. choose M = ⟨D, I ′ ⟩′ with I ′ :=I,[g/f ], then by subst. value lemma
I ′ φ ([f (X 1 , . . ., X k )/X](A)) = I ′ (φ,[I ′ φ (f (X 1 ,...,X k ))/X]) (A)

= I ′ (φ,[a/X]) (A) = F
F
5. So ([f (X 1 , . . ., X k )/X](A)) satisfiable in M′
This proof is paradigmatic for soundness proofs for calculi with Skolemization. We use the axiom
of choice at the meta-level to choose a meaning for the Skolem function symbol.
Armed with the Model Existence Theorem for first-order logic (Theorem B.1.18), the com-
pleteness of first-order tableaux is similarly straightforward. We just have to show that the col-
lection of tableau-irrefutable sentences is an abstract consistency class, which is a simple proof-
transformation exercise in all but the universal quantifier case, which we postpone to its own
Lemma (Theorem B.3.5).
Completeness of (T1f )
Theorem B.3.4. T1f is refutation complete.

Proof: We show that ∇:={Φ|ΦT has no closed Tableau} is an abstract consis-
tency class
1. as for propositional case.
2. by the lifting lemma below
F
3. Let T be a closed tableau for ¬(∀X A)∈Φ and ΦT ∗([c/X](A)) ∈∇.
ΨT ΨT
F F
(∀X A) (∀X A)
F F
([c/X](A)) ([f (X 1 , . . ., X k )/X](A))
Rest [f (X 1 , . . ., X k )/c](Rest)
So we only have to treat the case for the universal quantifier. This is what we usually call a
“lifting argument”, since we have to transform (“lift”) a proof for a formula θ(A) to one for A. In
the case of tableaux we do that by an induction on the tableau refutation for θ(A) which creates
a tableau-isomorphism to a tableau refutation for A.
Tableau-Lifting
Theorem B.3.5. If Tθ is a closed tableau for a set θ(Φ) of formulae, then there is
a closed tableau T for Φ.
Proof: by induction over the structure of Tθ we build an isomorphic tableau T , and

a tableau-isomorphism ω : T →Tθ , such that ω(A) = θ(A).
only the tableau-substitution rule is interesting.
T F
1. Let (θ(Ai )) and (θ(Bi )) cut formulae in the branch Θiθ of Tθ
2. there is a joint unifier σ of (θ(A1 ))=?(θ(B1 )) ∧ . . . ∧ (θ(An ))=?(θ(Bn ))
3. thus σ ◦ θ is a unifier of A and B
4. hence there is a most general unifier ρ of A1=?B1 ∧ . . . ∧ An=?Bn
5. so Θ is closed.
Again, the “lifting lemma for tableaux” is paradigmatic for lifting lemmata for other refutation
calculi.
B.4 Soundness and Completeness of First-Order Resolution
Correctness (CNF)
Lemma B.4.1. A set Φ of sentences is satisfiable, iff CNF1 (Φ) is.
Proof: propositional rules and ∀-rule are trivial; do the ∃-rule
B.4. SOUNDNESS AND COMPLETENESS OF FIRST-ORDER RESOLUTION 715
F
1. Let (∀X A) satisfiable in M:=⟨D, I⟩ and free(A) = {X 1 , . . ., X n }
2. I φ (∀X A) = F, so there is an a∈D with I φ,[a/X] (A) = F (only depends on
φ|free(A) )
3. let g : Dn →D be defined by g(a1 , . . ., an ):=a, iff φ(X i ) = ai .
4. choose M′ :=⟨D, I ′ ⟩ with I(f )′ :=g, then I ′ φ ([f (X 1 , . . . , X k )/X](A)) = F
F
5. Thus ([f (X 1 , . . . , X k )/X](A)) is satisfiable in M′
Resolution (Correctness)
Definition B.4.2. A clause is called satisfiable, iff I φ (A) = α for one of its literals
Aα .
Lemma B.4.3. 2 is unsatisfiable
Lemma B.4.4. CNF transformations preserve satisfiability (see above)
Lemma B.4.5. Resolution and factorization too!
Completeness (R1 )
Theorem B.4.6. R1 is refutation complete.
Proof: ∇:={Φ|ΦT has no closed tableau} is an abstract consistency class
1. as for propositional case.
2. by the lifting lemma below
F
3. Let T be a closed tableau for ¬(∀X A)∈Φ and ΦT ∗([c/X](A)) ∈∇.
F
4. CNF1 (ΦT ) = CNF1 (ΨT ) ∪ CNF1 (([f (X 1 , . . ., X k )/X](A)) )
F
5. ([f (X 1 , . . ., X k )/c](CNF1 (ΦT )))∗([c/X](A)) = CNF1 (ΦT )
6. so R1 : CNF1 (ΦT )⊢D′ 2, where D = [f (X1′ , . . . , Xk′ )/c](D).
Clause Set Isomorphism

Definition B.4.7. Let B and C be clauses, then a clause isomorphism ω : C→D
α
is a bijection of the literals of C and D, such that ω(L) = Mα (conserves labels)
α
We call ω θ compatible, iff ω(L ) = (θ(L))
α
Definition B.4.8. Let Φ and Ψ be clause sets, then we call a bijection Ω : Φ→Ψ
a clause set isomorphism, iff there is a clause isomorphism ω : C→Ω(C) for each
C∈Φ.
Lemma B.4.9. If θ(Φ) is set of formulae, then there is a θ-compatible clause set
isomorphism Ω : CNF1 (Φ)→CNF1 (θ(Φ)).
Proof sketch: by induction on the CNF derivation of CNF1 (Φ).
Lifting for R1
Theorem B.4.10. If R1 : (θ(Φ))⊢Dθ 2 for a set θ(Φ) of formulae, then there is a
R1 -refutation for Φ.
Proof: by induction over Dθ we construct a R1 -derivation R1 : Φ⊢D C and a θ-
compatible clause set isomorphism Ω : D→Dθ
Dθ′ Dθ′′
T F
1. If Dθ ends in ((θ(A)) ∨ (θ(C))) (θ(B)) ∨ (θ(D))
res
(σ(θ(C))) ∨ (σ(θ(B)))
T
then we have (IH) clause isormorphisms ω ′ : AT ∨ C→(θ(A)) ∨ (θ(C)) and
T
ω ′ : BT ∨ D→(θ(B)) , θ(D)
AT ∨ C BF ∨ D
2. thus Res where ρ = mgu(A, B)(exists, as σ ◦ θ unifier)
(ρ(C)) ∨ (ρ(B))

Notes

Uploaded by

Copyright:

Available Formats

Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Notes

Uploaded by

Copyright:

Available Formats

Artificial Intelligence 1

Winter Semester 2022/23

Prof. Dr. Michael Kohlhase

0.1.2 Course Contents

0.1.3 This Document

0.2 Recorded Syllabus

# date until slide page

# date topics Notes Video R&N

Recorded Syllabus Summer Semester 2022:

# date topics Notes Video R&N

45 June 23. Evaluating/Choosing Hypotheses 25.5 7.7/8 18.4

Prerequisites for AI-2

Skillset Prerequisite: Coping with mathematical formulation of the structures

This is what I assume you know! (I have to assume something)

Michael Kohlhase: Artificial Intelligence 2 1 2023-02-10

Feb. 13. 2023)

Michael Kohlhase: Artificial Intelligence 2 2 2023-02-10

AI-2 Homework Assignments

Michael Kohlhase: Artificial Intelligence 2 3 2023-02-10

Tutorials for Artificial Intelligence 1

Life-saving Advice: Go to your tutorial, and prepare for it by having looked at

Michael Kohlhase: Artificial Intelligence 2 4 2023-02-10

Cheating [adapted from CMU:15-211 (P. Lee, 2003)]

Michael Kohlhase: Artificial Intelligence 2 5 2023-02-10

Special Admin Conditions

Michael Kohlhase: Artificial Intelligence 2 6 2023-02-10

Format of the AI Course/Lecturing

Do I need to attend the lectures

Approach B: Read a Book

Approach S: come to the lectures and sleep does not work!

Michael Kohlhase: Artificial Intelligence 2 7 2023-02-10

Traditional Lectures (cum kilo salis)

1 with much more than the proverbial grain of salt.

It is well-known that frontal teaching does not optimize learning

Michael Kohlhase: Artificial Intelligence 2 8 2023-02-10

So there is a tension between

My Lectures? What can I do to keep you awake?

Michael Kohlhase: Artificial Intelligence 2 9 2023-02-10

Questionnaires: are my attempt to get you to interact

Michael Kohlhase: Artificial Intelligence 2 10 2023-02-10

More Generally: My Questions to You

How will I look for answers?

Michael Kohlhase: Artificial Intelligence 2 11 2023-02-10

Call for Help/Ideas with/for Questionnaires

You know something about AI-2 by then.

Michael Kohlhase: Artificial Intelligence 2 12 2023-02-10

Textbook, Handouts and Information, Forums, Videos

Course notes: will be posted at http://kwarc.info/teaching/AI/notes.pdf

Course Videos: AI-2 will be streamed/recorded at https://fau.tv/course/

Michael Kohlhase: Artificial Intelligence 2 13 2023-02-10

Experiment: E-Learning with KWARC Technologies

1. Re-Represent the slide materials in OMDoc (Open Mathematical Documents)

help me complete the material on the slides (what is missing/would help?)

Michael Kohlhase: Artificial Intelligence 2 14 2023-02-10

VoLL-KI Portal at https://courses.vollki.fau.de

Example 3.0.2 (Definition on Hover). When we hover on a (cyan) term refer-

When we click on the hover popup, we get even more information!

Michael Kohlhase: Artificial Intelligence 2 15 2023-02-10

Practical recommendations on Lecture Resources

recordings: Take notes.