Notes
Notes
Notes
2023-02-10
0.1. PREFACE i
0.1 Preface
0.1.1 Course Concept
Objective:
The course aims at giving students a solid (and often somewhat theoretically oriented) foun-
dation of the basic concepts and practices of artificial intelligence. The course will predominantly
cover symbolic AI – also sometimes called “good old-fashioned AI (GofAI)” – in the first semester
and offers the very foundations of statistical approaches in the second. Indeed, a full account sub
symbolic, machine learning based AI deserves its own specialization courses and needs much more
mathematical prerequisites than we can assume in this course.
Context: The course “Artificial Intelligence” (AI 1 & 2) at FAU Erlangen is a two-semester
course in the “Wahlpflichtbereich” (specialization phase) in semesters 5/6 of the Bachelor program
“Computer Science” at FAU Erlangen. It is also available as a (somewhat remedial) course in the
“Vertiefungsmodul Künstliche Intelligenz” in the Computer Science Master’s program.
Prerequisites: AI-1 & 2 builds on the mandatory courses in the FAU Bachelor’s program, in
particular the course “Grundlagen der Logik in der Informatik” [Glo], which already covers a lot
of the materials usually presented in the “knowledge and reasoning” part of an introductory AI
course. The AI 1& 2 course also minimizes overlap with the course.
The course is relatively elementary, we expect that any student who attended the mandatory
CS courses at FAU Erlangen can follow it.
Open to external students:
Other Bachelor programs are increasingly co-opting the course as specialization option. There
is no inherent restriction to computer science students in this course. Students with other study
biographies – e.g. students from other Bachelor programs our external Master’s students should
be able to pick up the prerequisites when needed.
0.1.4 Acknowledgments
Materials: Most of the materials in this course is based on Russel/Norvik’s book “Artificial
Intelligence — A Modern Approach” (AIMA [RN95]). Even the slides are based on a LATEX-based
slide set, but heavily edited. The section on search algorithms is based on materials obtained from
Bernhard Beckert (then Uni Koblenz), which is in turn based on AIMA. Some extensions have
been inspired by an AI course by Jörg Hoffmann and Wolfgang Wahlster at Saarland University
in 2016. Finally Dennis Müller suggested and supplied some extensions on AGI. Florian Rabe,
Max Rapp and Katja Berčič have carefully re-read the text and pointed out problems.
All course materials have bee restructured and semantically annotated in the STEX format, so
that we can base additional semantic services on them.
AI Students: The following students have submitted corrections and suggestions to this and
earlier versions of the notes: Rares Ambrus, Ioan Sucan, Yashodan Nevatia, Dennis Müller, Si-
mon Rainer, Demian Vöhringer, Lorenz Gorse, Philipp Reger, Benedikt Lorch, Maximilian Lösch,
Luca Reeb, Marius Frinken, Peter Eichinger, Oskar Herrmann, Daniel Höfer, Stephan Mattejat,
Matthias Sonntag, Jan Urfei, Tanja Würsching, Adrian Kretschmer, Tobias Schmidt, Maxim On-
ciul, Armin Roth, Liam Corona, Tobias Völk, Lena Voigt, Yinan Shao, Michael Girstl, Matthias
Vietz, Anatoliy Cherepantsev, Stefan Musevski, Matthias Lobenhofer, Philipp Kaludercic, Di-
warkara Reddy, Martin Helmke, Stefan Müller, Dominik Mehlich, Paul Martini, Vishwang Dave,
Arthur Miehlich, Christian Schabesberger, Vishaal Saravanan, Simon Heilig, Michelle Fribrance.
0.2. RECORDED SYLLABUS iii
Here the syllabus of the last academic year for reference, the current year should be similar.
Note that the video sections carry a link, you can just click them to get to the first video and then
got to the next video via the FAU.tv interface.
Recorded Syllabus Winter Semester 2021/22:
Administrativa
We will now go through the ground rules for the course. This is a kind of a social contract
between the instructor and the students. Both have to keep their side of the deal to make learning
as efficient and painless as possible.
Now we come to a topic that is always interesting to the students: the grading scheme.
Assessment, Grades
Academic Assessment: 90 minutes exam directly after courses end (∼ July 25
2023)
Retake Exam: 90 min exam directly after courses end the following semester (∼
1
2 CHAPTER 1. ADMINISTRATIVA
I basically do not have a choice in the grading sheme, as it is essentially the only one consistent with
university/state policies. For instance, I would like to give you more incentives for the homework
assignments – which would also mitigate the risk of having a bad day in the exam. Also, graded
quizzes would help you prepare for the lectures and thus let you get more out of them, but that
is also impossible.
Double Jeopardy : Homeworks only give 10% bonus points for the
exam, but without trying you are unlikely to pass the exam.
Admin: To keep things running smoothly
Homeworks will be posted on StudOn.
Sign up for AI-2 under https://www.studon.fau.de/crs4419186.html.
Homeworks are handed in electronically there. (plain text, program files, PDF)
Go to the tutorials, discuss with your TA! (they are there for you!)
Homework Discipline:
Start early! (many assignments need more than one evening’s work)
Don’t start by sitting at a blank screen (talking & study group help)
Humans will be trying to understand the text/code/math when grading it.
It is very well-established experience that without doing the homework assignments (or something
similar) on your own, you will not master the concepts, you will not even be able to ask sensible
questions, and take nothing home from the course. Just sitting in the course and nodding is not
enough! If you have questions please make sure you discuss them with the instructor, the teaching
assistants, or your fellow students. There are three sensible venues for such discussions: online in
the lecture, in the tutorials, which we discuss now, or in the course forum – see below. Finally, it
is always a very good idea to form study groups with your friends.
Approach: Weekly tutorials and homework assignments (first one in week two)
Goal 1: Reinforce what was taught in class. (you need practice)
Goal 2: Allow you to ask any question you have in a protected environment.
Instructor/Lead TA:
Florian Rabe (KWARC Postdoc)
Room: 11.137 @ Händler building, florian.rabe@fau.de
Tutorials: one each taught by Florian Rabe, . . .
One special case of academic rules that affects students is the question of cheating, which we will
cover next.
There is no need to cheat in this course!! (hard work will usually do)
Note: Cheating prevents you from learning (you are cutting into your own flesh)
We expect you to know what is useful collaboration and what is cheating.
You have to hand in your own original code/text/math for all assignments
You may discuss your homework assignments with others, but if doing so impairs
your ability to write truly original code/text/math, you will be cheating
Copying from peers, books or the Internet is plagiarism unless properly attributed
(even if you change most of the actual words)
I am aware that there may have been different standards about this at your previous
university! (these are the ground rules here)
There are data mining tools that monitor the originality of text/code.
Procedure: If we catch you at cheating. . . (correction: if we suspect cheating)
We will confront you with the allegation and impose a grade sanction.
If you have a reasonable explanation we lift that. (you have to convince us)
Note: Both active (copying from others) and passive cheating (allowing others to
copy) are penalized equally.
We are fully aware that the border between cheating and useful and legitimate collaboration is
difficult to find and will depend on the special case. Therefore it is very difficult to put this into
firm rules. We expect you to develop a firm intuition about behavior with integrity over the course
4 CHAPTER 1. ADMINISTRATIVA
of stay at FAU. Do use the opportunity to discuss the AI-2 topics with others. After all, one
of the non-trivial skills you want to learn in the course is how to talk about Artificial Intelligence
topics. And that takes practice, practice, and practice.
Due to the current AI hype, the course Artificial Intelligence is very popular and thus many
degree programs at FAU have adopted it for their curricula. Sometimes the course setup that fits
for the CS program does not fit the other’s very well, therefore there are some special conditions.
I want to state here.
In “Wirtschafts-Informatik” you can only take AI-1 and AI-2 together in the “Wahlpflicht-
bereich”.
ECTS credits need to be divisible by five ⇝ 7.5 + 7.5 = 15.
I can only warn of what I am aware, so if your degree program lets you jump through extra hoops,
please tell me and then I can mention them here.
Chapter 2
We now come to the organization of the AI lectures this semester this is really still part of the
admin, but important enough to warrant its own chapter. First let me state the obvious, but
there is an important point I want to make.
That being said – I know that it sounds quite idealistic – can I do something to help you along
in this? Let me digress on lecturing styles ; take the following with “cum kilo salis” 1 , I want to
make a point here, not bad-mouth my colleagues.!
5
6 CHAPTER 2. FORMAT OF THE AI COURSE/LECTURING
One person talks to 50+ students who just listen and take notes
The I have a book hat you do not have style makes it hard to stay awake
We know how to keep large audiences engaged and motivated (even televised)
But the topic is different (AI-2 is arguably more complex than Sports/Media)
We’re not gonna be able to go all the way to TV entertainment (“AI-2 total”)
But I am going to (try to) incorporate some elements . . .
I will use interactive elements I call “questionnaires”. Here is one example to give you an idea
of what is coming.
Questionnaire
Question: How many scientific articles (6-page double-column “papers”) were
submitted to the 2020 International Joint Conference on Artificial Intelligence (IJ-
CAI’20; online in Montreal)?
a) 7? (6 accepted for publication)
b) 811? (205 accepted accepted for publication)
c) 1996? (575 accepted accepted for publication)
d) 4717? (592 accepted accepted for publication)
Answer: (d) is correct. ((c) if for IJCAI’15 . . . )
One of the reasons why I like the questionnaire format is that it is a small instance of a question-
answer game that is much more effective in inducing learning – recall that learning happens in
the head of the student, no matter what the instructor tries to do – than frontal lectures. In
fact Sokrates – the grand old man of didactics – is said to have taught his students exclusively
by asking leading questions. His style coined the name of the teaching style “Socratic Dialogue”,
which unfortunately does not scale to a class of 100+ students.
Unfortunately, this idea of adding questionnaires is mitigated by a simple fact of life. Good
questionnaires require good ideas, which are hard to come by; in particular for AI-2-2, I do not
have many. But maybe you – the students – can help.
Resources
But what if you are not in a lecture or tutorial and want to find out more about the AI-2 topics?
Next we come to a special project that is going on in parallel to teaching the course. I am using the
course materials as a research object as well. This gives you an additional resource, but may affect
the shape of the coures materials (which now serve double purpose). Of course I can use all the
help on the research project I can get, so please give me feedback, report errors and shortcomings,
and suggest improvements.
9
10 CHAPTER 3. RESOURCES
FAU has issued a very insightful guide on using lecture recordings. It is a good idea to heed these
recommendations, even if they seem annoying at first.
Using lecture
Attend lectures.
Catch up.
We start the course by giving an overview of (the problems, methods, and issues of ) Artificial
Intelligence, and what has been achieved so far.
Naturally, this will dwell mostly on philosophical aspects – we will try to understand what
the important issues might be and what questions we should even be asking. What the most
important avenues of attacks may be and where AI research is being carried out.
In particular the discussion will be very non-technical – we have very little basis to discuss tech-
nicalities yet. But stay with me, this will drastically change very soon. A Video Nugget covering
the introduction of this chapter can be found at https://fau.tv/clip/id/21467.
13
14 CHAPTER 4. AI – WHO?, WHAT?, WHEN?, WHERE?, AND WHY?
Maybe we can get around the problems of defining “what Artificial intelligence is”, by just de-
scribing the necessary components of AI (and how they interact). Let’s have a try to see whether
that is more informative.
The components of Artificial Intelligence are quite daunting, and none of them are fully un-
derstood, much less achieved artificially. But for some tasks we can get by with much less. And
indeed that is what the field of Artificial Intelligence does in practice – but keeps the lofty ideal
around. This practice of “trying to achieve AI in selected and restricted domains” (cf. the discus-
sion starting with slide 27) has borne rich fruits: systems that meet or exceed human capabilities
in such areas. Such systems are in common use in many domains of application.
in outer space
in outer space systems
need autonomous con-
trol:
remote control impos-
sible due to time lag
in artificial limbs
the user controls the
prosthesis via existing
nerves, can e.g. grip
a sheet of paper.
in household appliances
The iRobot Roomba
vacuums, mops, and
sweeps in corners, . . . ,
parks, charges, and
discharges.
general robotic house-
hold help is on the
horizon.
in hospitals
in the USA 90% of the
prostate operations are
carried out by Ro-
boDoc
Paro is a cuddly robot
that eases solitude in
nursing homes.
4.2. ARTIFICIAL INTELLIGENCE IS HERE TODAY! 19
The AI Conundrum
Observation: Reserving the term “Artificial Intelligence” has been quite a land
grab!
But: researchers at the Dartmouth Conference (1950) really thought they would
solve/reach AI in two/three decades.
Consequence: AI still asks the big questions.
Another Consequence: AI as a field is an incubator for many innovative tech-
nologies.
Still Consequence: AI research was alternatingly flooded with money and cut off
brutally.
As a consequence, the field of Artificial Intelligence (AI) is an engineering field at the intersec-
tion of computer science (logic, programming, applied statistics), cognitive science (psychology,
neuroscience), philosophy (can machines think, what does that mean?), linguistics (natural lan-
guage understanding), and mechatronics (robot hardware, sensors).
Subsymbolic AI and in particular machine learning is currently hyped to such an extent, that
many people take it to be synonymous with “Artificial Intelligence”. It is one of the goals of this
course to show students that this is a very impoverished view.
We can classify the AI approaches by their coverage and the analysis depth (they
are complementary)
4.3. WAYS TO ATTACK THE AI PROBLEM 21
We combine the topics in this way in this course, not only because this reproduces the historical
development but also as the methods of statistical and subsymbolic AI share a common basis.
It is important to notice that all approaches to AI have their application domains and strong points.
We will now see that exactly the two areas, where symbolic AI and statistical/subsymbolic AI
have their respective fortes correspond to natural application areas.
Precision
100% Producer Tasks
General Rule: Subsymbolic AI is well suited for consumer tasks, while symbolic
AI is better suited for producer tasks.
An example of a producer task – indeed this is where the name comes from – is the case of a
machine tool manufacturer T , which produces digitally programmed machine tools worth multiple
22 CHAPTER 4. AI – WHO?, WHAT?, WHEN?, WHERE?, AND WHY?
million Euro and sells them into dozens of countries. Thus T must also comprehensive machine
operation manuals, a non-trivial undertaking, since no two machines are identical and they must
be translated into many languages, leading to hundreds of documents. As those manual share a lot
of semantic content, their management should be supported by AI techniques. It is critical that
these methods maintain a high precision, operation errors can easily lead to very costly machine
damage and loss of production. On the other hand, the domain of these manuals is quite restricted.
A machine tool has a couple of hundred components only that can be described by a comple of
thousand attribute only.
Indeed companies like T employ high-precision AI techniques like the ones we will cover in this
course successfully; they are just not so much in the public eye as the consumer tasks.
One can usually defuse public worries about “is AI going to take control over the world” by just
explaining the difference between strong AI and weak AI clearly.
I would like to add a few words on AGI, that – if you adopt them; they are not universally accepted
– will strengthen the arguments differentiating between strong and weak AI.
I want to conclude this section with an overview over the recent protagonists – both personal and
institutional – of AGI.
Planning Frameworks
Planning Algorithms
Planning and Acting in the real world
Observation: The ability to represent knowledge about the world and to draw
logical inferences is one of the central components of intelligent behavior.
Thus: reasoning components of some form are at the heart of many AI systems.
Research in the KWARC group ranges over a variety of topics, which range from foundations
of mathematics to relatively applied web information systems. I will try to organize them into
three pillars here.
For all of these areas, we are looking for bright and motivated students to work with us. This
can take various forms, theses, internships, and paid student assistantships.
Sciences like physics or geology, and engineering need high-powered equipment to perform
measurements or experiments. computer science and in particular the KWARC group needs high
powered human brains to build systems and conduct thought experiments.
The KWARC group may not always have as much funding as other AI research groups, but
we are very dedicated to give the best possible research guidance to the students we supervise.
So if this appeals to you, please come by and talk to us.
28 CHAPTER 4. AI – WHO?, WHAT?, WHEN?, WHERE?, AND WHY?
Part I
29
31
This part of the course note sets the stage for the technical parts of the course by establishing
a common framework (Rational Agents) that gives context and ties together the various methods
discussed in the course.
After having seen what AI can do and where AI is being employed today (see chapter 4), we will
now
Logic Programming
We will now learn a new programming paradigm: logic programming, which is one of the most
influential paradigms in AI. We are going to study ProLog (the oldest and most widely used) as a
concrete example of ideas behind logic programming and use it for our homeworks in this course.
As ProLog is a representative of a programming paradigm that is new to most students, pro-
gramming will feel weird and tedious at first. But subtracting the unusual syntax and program
organization logic programming really only amounts to recursive programming just as in func-
tional programming (the other declarative programming paradigm). So the usual advice applies,
keep staring at it and practice on easy examples until the pain goes away.
Logic programming is a programming paradigm that differs from functional and imperative
programming in the basic procedural intuition. Instead of transforming the state of the memory
by issuing instructions (as in imperative programming), or computing the value of a function on
some arguments, logic programming interprets the program as a body of knowledge about the
respective situation, which can be queried for consequences.
This is actually a very natural conception of program; after all we usually run (imperative or
functional) programs if we want some question answered. Video Nuggets covering this section
can be found at https://fau.tv/clip/id/21752 and https://fau.tv/clip/id/21753. .
Logic Programming
Idea: Use logic as a programming language!
We state what we know about a problem (the program) and then ask for results
(what the program would compute).
Example 5.1.1.
33
34 CHAPTER 5. LOGIC PROGRAMMING
How to achieve this? Restrict a logic calculus sufficiently that it can be used as
computational procedure.
Remark: This idea leads a totally new programming paradigm: logic programming.
We now formally define the language of ProLog, starting off the atomic building blocks.
The first three lines are ProLog facts and the last a rule.
5.1. INTRODUCTION TO LOGIC PROGRAMMING AND PROLOG 35
Definition 5.1.7. The knowledge base given by a ProLog program is the set of
facts that can be derived from it under the if/and reading above.
Definition 5.1.7 introduces a very important distinction: that between a ProLog program and the
knowledge base it induces. Whereas the former is a finite, syntactic object (essentially a string),
the latter may be an infinite set of facts, which represents the totality of knowledge about the
world or the aspects described by the program.
As knowledge bases can be infinite, we cannot pre compute them. Instead, logic programming
languages compute fragments of the knowledge base by need; i.e. whenever a user wants to check
membership; we call this approach querying: the user enters a query term and the system answers
yes or no. This answer is computed in a depth first search process.
Definition 5.1.11. The ProLog interpreter keeps backchaining from the top to
the bottom of the program until the query
succeeds, i.e. contains no more goals, or (answer: true)
fails, i.e. backchaining becomes impossible. (anser: false)
?− nat(s(s(zero))).
?− nat(s(zero)).
?− nat(zero).
true
Note that backchaining replaces the current query with the body of the rule suitably instantiated.
For rules with a long body this extends the list of current goals, but for facts (rules without a
body), backchaining shortens the list of current goals. Once there are no goals left, the ProLog
interpreter finishes and signals success by issuing the string true.
If no rules match the current goal, then the interpreter terminates and signals failure with the
string false,
We can extend querying from simple yes/no answers to programs that return values by simply
using variables in queries. In this case, the ProLog interpreter returns a substitution.
In Example 5.1.15 the first backchaining step binds the variable X to they query variable Y, which
gives us the two subgoals has_wheels(Y,4),has_motor(Y). which again have the query variable Y.
5.2. PROGRAMMING AS SEARCH 37
The next backchaining step binds this to mybmw, and the third backchaining step exhausts the
subgoals. So the query succeeds with the (overall) answer substitution Y = mybmw. With this
setup, we can already do the “fallible Greeks” example from the introduction.
In this section, we want to really use ProLog as a programming language, so let use first get
our tools set up.
Video Nuggets covering this section can be found at https://fau.tv/clip/id/21754 and
https://fau.tv/clip/id/21827.
?− unat(suc(suc(zero))).
Even though we can use any text editor to program ProLog, but running ProLog in a modern
editor with language support is incredibly nicer than at the command line, because you can see
the whole history of what you have done. Its better for debugging too. We will use emacs as an
example in the following.
If you’ve never used emacs before, it still might be nicer, since its pretty easy to get used to the
little bit of emacs that you need. (Just type “emacs \&” at the UNIX command line to run it; if
you are on a remote terminal without visual capabilities, you can use “emacs −nw”.).
If you don’t already have a file in your home directory called “.emacs” (note the dot at the front),
create one and put the following lines in it. Otherwise add the following to your existing .emacs
file:
(autoload ’run−prolog "prolog" "Start a Prolog sub−process." t)
(autoload ’prolog−mode "prolog" "Major mode for editing Prolog programs." t)
(setq prolog−program−name "swipl"); or whatever the prolog executable name is
(add−to−list ’auto−mode−alist ’("\\pl$" . prolog−mode))
The file prolog.el, which provides prolog−mode should already be installed on your machine, oth-
erwise download it at http://turing.ubishops.ca/home/bruda/emacs-prolog/
Now, once you’re in emacs, you will need to figure out what your “meta” key is. Usually its the alt
key. (Type “control” key together with “h” to get help on using emacs). So you’ll need a “meta−X”
command, then type “run−prolog”. In other words, type the meta key, type “x”, then there will
be a little window at the bottom of your emacs window with “M−x”, where you type run−prolog3 .
This will start up the SWI ProLog interpreter, . . . et voilà!
The best thing is you can have two windows “within” your emacs window, one where you’re
editing your program and one where you’re running ProLog. This makes debugging easier.
So far, all the examples led to direct success or to failure. (simpl. KB)
Definition 5.2.1 (ProLog Search Procedure). The ProLog interpreter employes
top-down, left-right depth first search, concretely, it:
works on the subgoals in left right order.
matches first query with the head literals of the clauses in the program in top-
down order.
if there are no matches, fail and backtrack to the (chronologically) last backtrack
point.
otherwise backchain on the first match, keep the other matches in mind for
backtracking via backtrack points.
We can force backtracking to get more solutions by typing ;.
Note:
With the ProLog search procedure detailed above, computation can easily go into infinite loops,
even though the knowledge base could provide the correct answer. Consider for instance the simple
program
p(X):− p(X).
p(X):− q(X).
q(X).
3 Type “control” key together with “h” then press “m” to get an exhaustive mode help.
5.2. PROGRAMMING AS SEARCH 39
If we query this with ?− p(john), then DFS will go into an infinite loop because ProLog expands
by default the first predicate. However, we can conclude that p(john) is true if we start expanding
the second predicate.
In fact this is a necessary feature and not a bug for a programming language: we need to
be able to write non-terminating programs, since the language would not be Turing complete
otherwise. The argument can be sketched as follows: we have seen that for Turing machines the
halting problem is undecidable. So if all ProLog programs were terminating, then ProLog would
be weaker than Turing machines and thus not Turing complete.
We will now fortify our intuition about the ProLog search procedure by an example that extends
the setup from Example 5.1.15 by a new choice of a vehicle that could be a car (if it had a motor).
Backtracking by Example
Example 5.2.2. We extend Example 5.1.15:
has_wheels(mytricycle,3).
has_wheels(myrollerblade,3).
has_wheels(mybmw,4).
has_motor(mybmw).
car(X):-has_wheels(X,3),has_motor(X). % cars sometimes have three wheels
car(X):-has_wheels(X,4),has_motor(X). % and sometimes four.
?- car(Y).
?- has_wheels(Y,3),has_motor(Y). % backtrack point 1
Y = mytricycle % backtrack point 2
?- has_motor(mytricycle).
FAIL % fails, backtrack to 2
Y = myrollerblade % backtrack point 2
?- has_motor(myrollerblade).
FAIL % fails, backtrack to 1
?- has_wheels(Y,4),has_motor(Y).
Y = mybmw
?- has_motor(mybmw).
Y=mybmw
true
In general, a ProLog rule of the form A:−B,C reads as A, if B and C. If we want to express A if
B or C, we have to express this two separate rules A:−B and A:−C and leave the choice which
one to use to the search procedure.
In Example 5.2.2 we indeed have two clauses for the predicate car/1; one each for the cases of cars
with three and four wheels. As the three-wheel case comes first in the program, it is explored first
in the search process.
Recall that at every point, where the ProLog interpreter has the choice between two clauses for
a predicate, chooses the first and leaves a backtrack point. In Example 5.2.2 this happens first
for the predicate car/1, where we explore the case of three-wheeled cars. The ProLog interpreter
immediately has to choose again – between the tricycle and the rollerblade, which both have three
wheels. Again, it chooses the first and leaves a backtrack point. But as tricycles do not have mo-
tors, the subgoal has_motor(mytricycle) fails and the interpreter backtracks to the chronologically
nearest backtrack point (the second one) and tries to fulfill has_motor(myrollerblade). This fails
again, and the next backtrack point is point 1 – note the stack-like organization of backtrack points
which is in keeping with the depth-first search strategy – which chooses the case of four-wheeled
cars. This ultimately succeeds as before with y=mybmw.
We now turn to a more classical programming task: computing with numbers. Here we turn
to our initial example: adding unary natural numbers. If we can do that, then we have to consider
40 CHAPTER 5. LOGIC PROGRAMMING
expt(X,zero,s(zero)).
expt(X,s(Y),Z) :− expt(X,Y,W), mult(X,W,Z).
Note: Viewed through the right glasses logic programming is very similar to functional program-
ming; the only difference is that we are using n + 1 ary relations rather than n ary function. To see
how this works let us consider the addition function/relation example above: instead of a binary
function + we program a ternary relation add, where relation add(X,Y ,Z) means X + Y = Z. We
start with the same defining equations for addition, rewriting them to relational style.
The first equation is straight-forward via our correspondence and we get the ProLog fact
add(X,zero,X). For the equation X + s(Y ) = s(X + Y ) we have to work harder, the straight-
forward relational translation add(X,s(Y),s(X+Y)) is impossible, since we have only partially
replaced the function + with the relation add. Here we take refuge in a very simple trick that we
can always do in logic (and mathematics of course): we introduce a new name Z for the offending
expression X + Y (using a variable) so that we get the fact add(X,s(Y ),s(Z)). Of course this is
not universally true (remember that this fact would say that “X + s(Y ) = s(Z) for all X, Y , and
Z”), so we have to extend it to a ProLog rule add(X,s(Y),s(Z)):−add(X,Y,Z). which relativizes to
mean “X + s(Y ) = s(Z) for all X, Y , and Z with X + Y = Z”.
Indeed the rule implements addition as a recursive predicate, we can see that the recursion
relation is terminating, since the left hand sides have one more constructor for the successor
function. The examples for multiplication and exponentiation can be developed analogously, but
we have to use the naming trick twice.
We now apply the same principle of recursive programming with predicates to other examples
to reinforce our intuitions about the principles.
?−add(s(zero),X,s(s(s(zero)))).
X = s(s(zero))
true
Example 5.2.5.
Computing the nth Fibonacci number (0, 1, 1, 2, 3, 5, 8, 13,. . . ; add the last two
to get the next), using the addition predicate above.
fib(zero,zero).
fib(s(zero),s(zero)).
fib(s(s(X)),Y):−fib(s(X),Z),fib(X,W),add(Z,W,Y).
Note: Note that the is relation does not allow “generate and test” inversion as it insists on the
right hand being ground. In our example above, this is not a problem, if we call the fib with the
first (“input”) argument a ground term. Indeed, if match the last rule with a goal ?− g,Y., where
g is a ground term, then g−1 and g−2 are ground and thus D and E are bound to the (ground)
result terms. This makes the input arguments in the two recursive calls ground, and we get ground
results for Z and W, which allows the last goal to succeed with a ground result for Y. Note as
well that re-ordering the bodys literal of the rule so that the recursive calls are called before the
computation literals will lead to failure.
We will now add the primitive data structure of lists to ProLog; they are constructed by prepending
an element (the head) to an existing list (which becomes the rest list or “tail” of the constructed
one).
append([],L,L).
append([X|R],L,[X|S]):−append(R,L,S).
reverse([],[]).
reverse([X|R],L):−reverse(R,S),append(S,[X],L).
Just as in functional programming languages, we can define list operations by recursion, only that
we program with relations instead of with functions.
42 CHAPTER 5. LOGIC PROGRAMMING
Logic programming is the third large programming paradigm (together with functional program-
ming and imperative programming).
Example 5.2.9.
Generate and test:
sort(Xs,Ys) :− perm(Xs,Ys), ordered(Ys).
From a programming practice point of view it is probably best understood as “relational program-
ming” in analogy to functional programming, with which it shares a focus on recursion.
The major difference to functional programming is that “relational programming” does not have
a fixed input/output distinction, which makes the control flow in functional programs very direct
and predictable. Thanks to the underlying search procedure, we can sometime make use of the
flexibility afforded by logic programming.
If the problem solution involves search (and depth first search is sufficient), we can just get
by with specifying the problem and letting the ProLog interpreter do the rest. In Example 5.2.9
we just specify that list Xs can be sorted into Ys, iff Ys is a permutation of Xs and Ys is ordered.
Given a concrete (input) list Xs, the ProLog interpreter will generate all permutations of Ys of Xs
via the predicate perm/2 and then test them whether they are ordered.
This is a paradigmatic example of logic programming. We can (sometimes) directly use the
specification of a problem as a program. This makes the argument for the correctness of the
program immediate, but may make the program execution non optimal.
It is easy to see that the running time of the ProLog program from Example 5.2.9 is not
O(nlog2 (n)) which is optimal for sorting algorithms. This is the flip side of the flexibility in logic
programming. But ProLog has ways of dealing with that: the cut operator, which is a ProLog
atom, which always succeeds, but which cannot be backtracked over. This can be used to prune
the search tree in ProLog. We will not go into that here but refer the readers to the literature.
We “define” the computational behavior of the predicate rev, but the list constructors
[. . .] are just used to construct lists from arguments.
Example 5.2.14 (Trees and Leaf Counting). We represent (unlabelled) trees via
the function t from tree lists to trees. For instance, a balanced binary tree of depth
2 is t([t([t([]),t([])]),t([t([]),t([])])]). We count leaves by
leafcount(t([]),1).
leafcount(t([X|R]),Y) :− leafcount(X,Z), leafcount(t(R,W)), Y is Z + W.
RTFM (b
= “read the fine manuals”)
RTFM Resources: There are also lots of good tutorials on the web,
I personally like [Fis; LPN],
[Fla94] has a very thorough logic-based introduction,
consult also the SWI Prolog Manual [SWI],
In this chapter we will briefly recap some of the prerequisites from theoretical computer science
that are needed for understanding Artificial Intelligence 1.
We now come to an important topic which is not really part of Artificial Intelligence but which
adds an important layer of understanding to this enterprise: We (still) live in the era of Moore’s
law (the computing power available on a single CPU doubles roughly every two years) leading to an
exponential increase. A similar rule holds for main memory and disk storage capacities. And the
production of computer (using CPUs and memory) is (still) very rapidly growing as well; giving
mankind as a whole, institutions, and individual exponentially grow of computational resources.
In public discussion, this development is often cited as the reson why (strong) AI is inevitable.
But the argument is fallacious if all the algorithms we have are of very high complexity (i.e. at
least exponential in either time or space). So, to judge the state of play in Artificial Intelligence,
we have to know the complexity of our algorithms.
In this section, we will give a very brief recap of some aspects of elementary complexity theory
and make a case of why this is a generally important for computer scientists.
A Video Nugget covering this section can be found at https://fau.tv/clip/id/21839 and
https://fau.tv/clip/id/21840.
In order to get a feeling what we mean by “fast algorithm”, we to some preliminary computations.
45
46CHAPTER 6. RECAP OF PREREQUISITES FROM MATH & THEORETICAL COMPUTER SCIENCE
performance
size linear quadratic exponential
n 100nµs 7n2 µs 2n µs
1 100µs 7µs 2µs
5 .5ms 175µs 32µs
10 1ms .7ms 1ms
45 4.5ms 14ms 1.1Y
100 ... ... ...
1 000 ... ... ...
10 000 ... ... ...
1 000 000 ... ... ...
performance
size linear quadratic exponential
n 100nµs 7n2 µs 2n µs
1 100µs 7µs 2µs
5 .5ms 175µs 32µs
10 1ms .7ms 1ms
45 4.5ms 14ms 1.1Y
< 100 100ms 7s 1016 Y
1 000 1s 12min −
10 000 10s 20h −
1 000 000 1.6min 2.5mon −
So it does make a difference for larger problems what algorithm we choose. Considerations like
the one we have shown above are very important when judging an algorithm. These evaluations
go by the name of “complexity theory”.
Let us now recapitulate some notions of elementary complexity theory: we are interested in the
worst case growth of the resources (time and space) required by an algorithm in terms of the sizes
of its arguments. Mathematically we look at the functions from input size to resource size and
classify them into “big-O” classes, abstracting from constant factors (which depend on the machine
thealgorithm runs on and which we cannot control) and initial (algorithm startup) factors.
time T (α):=t.
We say that α has time complexity in S (written T (α)∈S or colloquially T (α)=S),
iff t∈S. We say α has space complexity in S, iff α uses only memory of size s(n)
on inputs of size n and s∈S.
Landau set class name rank Landau set class name rank
O(1) constant 1 O(n2 ) quadratic 4
O(ln(n)) logarithmic 2 O(nk ) polynomial 5
O(n) linear 3 O(k n ) exponential 6
For AI-2: I expect that given analgorithm, you can determine its complexity class.
(next)
OK, that was the theory, . . . but how do we use that in practice.
The time complexity T (α) is just T∅ (α), where ∅ is the empty function.
Recursion is much more difficult to analyze ; recurrence relations and Master’s
theorem.
Please excuse the chemistry pictures, public imagery for CS is really just quite boring, this is what
people think of when they say “scientist”. So, imagine that instead of a chemist in a lab, it’s me
sitting in front of a computer.
But my 2nd attempt didn’t work either, which got me a bit agitated.
Ta-da . . . when, for once, I turned around and looked in the other direction–
CAN one actually solve this efficiently? – NP hardness was there to rescue me.
Example 6.1.4. Trying to find a sea route east to India (from Spain) (does not
exist)
50CHAPTER 6. RECAP OF PREREQUISITES FROM MATH & THEORETICAL COMPUTER SCIENCE
Observation: Complexity theory saves you from spending lots of time trying to
invent algorithms that do not exist.
It’s like, you’re trying to find a route to India (from Spain), and you presume it’s somewhere to
the east, and then you hit a coast, but no; try again, but no; try again, but no; ... if you don’t
have a map, that’s the best you can do. But NP hardness gives you the map: you can check that
there actually is no way through here.
But what is this notion of NP completness alluded to above? We observe that we can analyze
the complexity of computational problems by the complexity classcomplexity of the algorithms
that solve them. This gives us a notion of what to expect from solutions to a given problem class,
and thus whether efficient (i.e. polynomial time) algorithms can exist at all.
Assume: In 3 years from now, you have finished your studies and are working in
your first industry job. Your boss Mr. X gives you a problem and says Solve It!. By
which he means, write a program that solves it efficiently.
Question: Assume further that, after trying in vain for 4 weeks, you got the next
meeting with Mr. X. How could knowing about NP hardness help?
Answer: reserved for the plenary sessions ; be there!
We have multiple notations for concatenation, since it is such a basic operation, which is used
so often that we will need very short notations for it, trusting that the reader can disambiguate
based on the context.
Now that we have defined the concept of a string as a sequence of characters, we can go on to
give ourselves a way to distinguish between good strings (e.g. programs in a given programming
52CHAPTER 6. RECAP OF PREREQUISITES FROM MATH & THEORETICAL COMPUTER SCIENCE
language) and bad strings (e.g. such with syntax errors). The way to do this by the concept of a
formal language, which we are about to define.
Formal Languages
S
Definition 6.2.7. Let A be an alphabet, then we define the sets A+ := i∈N+ Ai
of nonempty string and A∗ :=A+ ∪ {ϵ} of strings.
Example 6.2.8. If A = {a, b, c}, then A∗ = {ϵ, a, b, c, aa, ab, ac, ba, . . . , aaa, . . . }.
Definition 6.2.9. A set L ⊆ A∗ is called a formal language over A.
Definition 6.2.10.
We use c[n] for the string that consists of the character c repeated n times.
Example 6.2.11.
#[5] = ⟨#, #, #, #, #⟩
Example 6.2.12.
The set M :={ba[n] |n∈N} of strings that start with character b followed by an
arbitrary numbers of a’s is a formal language over A = {a, b}.
Definition 6.2.13 (Operations on Languages).
Let L, L1 , and L2 be formal languages over the same alphabet, then we define
language level operations:
L1 L2 :={s1 s2 |s1 ∈L1 ∧ s2 ∈L2 }, L+ :={s+ |s∈L}, and L∗ :={s∗ |s∈L}.
There is a common misconception that a formal language is something that is difficult to under-
stand as a concept. This is not true, the only thing a formal language does is separate the “good”
from the bad strings. Thus we simply model a formal language as a set of stings: the “good”
strings are members, and the “bad” ones are not.
Of course this definition only shifts complexity to the way we construct specific formal languages
(where it actually belongs), and we have learned two (simple) ways of constructing them: by
repetition of characters, and by concatenation of existing languages.
As mentioned above, the purpose of a formal language is to distinguish “good” from “bad”
strings. It is maximally general, but not helpful, since it does not support computation and
inference. In practice we will be interested in formal languages that have some structure, so that
we can represent formal languages in a finite manner (recall that a formal language is a subset of
A∗ , which may be infinite and even undecidable – even though the alphabet A is finite).
To remedy this, we will now introduce phrase structure grammars (or just grammars), the
standard tool for describing structured formal languages.
Definition 6.2.14.
A phrase structure grammar (or just grammar) is a tuple ⟨N , Σ, P , S ⟩ where
N is a finite set of nonterminal symbols,
Σ is a finite set of terminal symbols, members of Σ ∪ N are called symbols.
∗ ∗
P is a finite set of grammar rules: pairs p:=h→b, where h∈(Σ ∪ N ) , N , (Σ ∪ N )
∗
and b∈(Σ ∪ N ) . The string h is called the head of p and b the body.
s∈N is a distinguished symbol called the start symbol (also sentence symbol).
Intuition: Grammar rules map strings with at least one nonterminal to arbitrary
other strings.
Notation:
If we have n rules h→bi sharing a head, we often write h→b1 | . . . | bn instead.
We fortify our intuition about these – admittedly very abstract – constructions by an example
and introduce some more vocabulary.
S → NP ; Vi
NP → Article; N
Article → the | a | an
N → dog | teacher | . . .
Vi → sleeps | smells | . . .
Definition 6.2.16. The subset of lexical rules, i.e. those whose body consists of a
single terminal is called its lexicon and the set of body symbols the alphabet. The
nonterminals in their heads are called lexical categories.
Definition 6.2.17. The non-lexicon grammar rules are called structural, and the
nonterminals in the heads are called phrasal categories.
Now we look at just how a grammar helps in analyzing formal languages. The basic idea is
that a grammar accepts a word, iff the start symbol can be rewritten into it using only the rules
of the grammar.
∗ ∗
derives t∈(Σ ∪ N ) from s∈(Σ ∪ N ) in one step, iff there is a grammar rule p∈P
∗
with p = h→b and there are u, v∈(Σ ∪ N ) , such that s = uhv and t = ubv. We
p
write s→G t (or s→G t if p is clear from the context) and use →∗G for the transitive
reflexive closure of →G . We call s→∗G t a G derivation of t from s.
S →G NP ; Vi
→G Article; N ; Vi
→G Article; teacher; Vi S → NP ; Vi
NP → Article; N
Article → the | a | an | . . .
2. The teacher sleeps is a sentence.
N → dog | teacher | . . .
S →∗G Article; teacher; Vi Vi → sleeps | smells | . . .
→G the; teacher; Vi
→G the; teacher; sleeps
Note that this process indeed defines a formal language given a grammar, but does not provide
an efficient algorithm for parsing, even for the simpler kinds of grammars we introduce below.
S → a; b; c | A
A → a; A; B ; c | a; b; c
c; B → B; c
b; B → b; b
alternative: s1 | . . . | sn ,
repetition: s∗ (arbitrary many s) and s+ (at least one s),
optional: [s] (zero or one times), and
grouping: (s1 ; . . . ; sn ), useful e.g. for repetition.
Example 6.2.26.
S ::= a; S; a
S ::= b; S; b
S ::= c
S ::= d
close to the official name of the syntactic category (for the use in the head)
In AI-2 we will only use context-free grammars (simpler, but problem still applies)
in AI-2: I will try to give “grammar overviews” that combine those, e.g. the
grammar of first-order logic.
variables X ∈ V1
functions fk ∈ Σfk
predicates pk ∈ Σpk
terms t ::= X variable
| f0 constant
| f k (t1 , . . ., tk ) application
formulae A ::= pk (t1 , . . ., tk ) atomic
| ¬A negation
| A1 ∧ A2 conjunction
| ∀X A quantifier
We will generally get by with context-free grammars, which have highly efficient into parsing
algorithms, for the formal language we use in this course, but we will not cover the algorithms in
AI-2.
Mathematical Structures
Observation: Mathematicians often cast object classes as mathematical struc-
tures.
We have just seen this: repeated here for convenience.
Definition 6.3.1.
A phrase structure grammar (or just grammar) is a tuple ⟨N , Σ, P , S ⟩ where
Observation: Even though we call grammar rules “pairs” above, they are also
mathematical structures ⟨h, b⟩ with a funny notation h→b.
Most programming languages have some way of creating “named structures”. Ref-
erencing components is usually done via “dot notation”
Example 6.3.2 (Structs in C).
// Create strutures grule grammar
struct grule {
char[][] head;
char[][] body;
}
struct grammar {
char[][] nterminals;
char[][] termininals;
grule[] grules;
char[] start;
}
int main() {
struct grule r1;
r1.head = "foo";
r1.body = "bar";
}
I will try to always give “structure overviews”, that combine notations with “type”
information and accessor names, e.g.
In this chapter, we introduce a framework that gives a comprehensive conceptual model for the
multitude of methods and algorithms we cover in this course. The framework of rational agents
accommodates two traditions of AI.
Initially, the focus of AI research was on symbolic methods concentrating on the mental processes
of problem solving, starting from Newell/Simon’s “physical symbol hypothesis”:
A physical symbol system has the necessary and sufficient means for general intelligent action.
[NS76]
Here a symbol is a representation an idea, object, or relationship that is physically manifested in
(the brain of) an intelligent agent (human or artificial).
Later – in the 1980s – the proponents of embodied AI posited that most features of cognition,
whether human or otherwise, are shaped – or at least critically influenced – by aspects of the
entire body of the organism. The aspects of the body include the motor system, the perceptual
system, bodily interactions with the environment (situatedness) and the assumptions about the
world that are built into the structure of the organism. They argue that symbols are not always
necessary since
The world is its own best model. It is always exactly up to date. It always has every detail
there is to be known. The trick is to sense it appropriately and often enough. [Bro90]
The framework of rational agents initially introduced by Russell and Wefald in [RW91] – ac-
commodates both, it situates agents with percepts and actions in an environment, but does not
preclude physical symbol systems – i.e. systems that manipulate symbols as agent functions. Rus-
sell and Norvig make it the central metaphor of their book “Artificial Intelligence – A modern
approach” [RN03], which we follow in this course.
59
60 CHAPTER 7. RATIONAL AGENTS: AN AI FRAMEWORK
Humanly Rational
Thinking “The exciting new effort “The formalization of mental
to make computers think faculties in terms of computa-
. . . machines with human-like tional models” [CM85]
minds” [Hau85]
Acting “The art of creating machines “The branch of CS concerned
that perform actions requiring with the automation of appro-
intelligence when performed by priate behavior in complex situ-
people” [Kur90] ations” [LS93]
We now discuss all of the four facets in a bit more detail, as they all either contribute directly
to our discussion of AI methods or characterize neighboring disciplines.
It was predicted that by 2000, a machine might have a 30% chance of fooling a lay
person for 5 minutes.
Note: In [Tur50], Alan Turing
Acting Rationally
Idea: Rational behavior =
b doing the right thing!
Definition 7.1.4. Rational behavior consists of always doing what is expected to
maximize goal achievement given the available information.
Rational behavior does not necessarily involve thinking e.g., blinking reflex — but
thinking should be in the service of rational action.
Aristotle: Every art and every inquiry, and similarly every action and pursuit, is
thought to aim at some good. (Nicomachean Ethics)
Central Idea: This course is about designing agent that exhibit rational behavior,
i.e. for any given class of environments and tasks, we seek the agent (or class of
agents) with the best performance.
We assume that agents can always perceive their own actions. (but not necessarily
their consequences)
Problem: agent functions can become very big (theoretical tool only)
Definition 7.2.6. An agent function can be implemented by an agent program
that runs on a physical agent architecture.
Agent Sensors
Percepts
Environment
?
Actions
Actuators
Figure 2.1 Agents interact with environments through sensors and actuators.
Different agents differ on the contents of the white box in the center.
AGENT FUNCTION described by the agent function that maps any given percept sequence to an action.
We can imagine tabulating the agent function that describes any given agent; for most
Example:
agents, Vacuum-Cleaner
this would be a very large table—infinite, World andin Agent fact, unless we place a bound on the
length of percept sequences we want to consider.Percept Givensequence
an agent to experiment with, Actionwe can,
in principle, construct this table by trying out all possible percept sequences and
[A, Clean] recording
Right
[A, Dirty] Suck
which actions the agent does in response.1 The table is,
[B, Clean]
of course, an external characterization
Lef t
of the agent. Internally, the agent function for an [B, artificial
Dirty] agent will be implemented Suck by an
AGENT PROGRAM agent program. It is important to keep these [A, twoClean],
ideas [A, distinct.
Clean] The agent function Right is an
[A, Clean], [A, Dirty] Suck
abstract mathematical description; the agent program is a concrete
[A, Clean], [B, Clean] implementation, Lef t running
within some physical system. [A, Clean], [B, Dirty] Suck
percepts:
To illustrate these ideas,location we use and acon- [A, Dirty],
very simple [A, Clean]
example—the vacuum-cleaner Right world
[A, Dirty], [A, Dirty] Suck
tents, e.g.,
shown in Figure 2.2. This world is so simple that
[A, Dirty] .. we can describe everything that . happens;
it’s also a made-up world, so we can invent many . variations. This particular world.. has just two
actions: Lef t, Right, Suck, [A, Clean], [A, Clean], [A, Clean] Right
locations: squares A and B. The vacuum agent[A,perceives
N oOp Clean], [A, which square
Clean], it is in Suck
[A, Dirty] and whether
there is dirt in the square. It can choose to move .
.. left, move right, suck up the .. dirt, or do
.
nothing. One very simple agent function is the following: if the current square is dirty, then
suck; otherwise,
Science move to the
Question: otherissquare.
What the right A partial tabulation of this agent function is shown
agent function?
in Figure 2.3 and an agent program that implements it appears in Figure 2.8 on page 48.
AI Question: Is there an agent architecture and an agent program that implements
Looking at Figure 2.3, we see that various vacuum-world agents can be defined simply
it.
by filling in the right-hand column in various ways. The obvious question, then, is this: What
is the right way to fill out the table? In other words, what makes an agent good or bad,
Michael Kohlhase: Artificial Intelligence 2 84 2023-02-10
intelligent or stupid? We answer these questions in the next section.
1
Example: Vacuum-Cleaner World and Agent
If the agent uses some randomization to choose its actions, then we would have to try each sequence many
times to identify the probability of each action. One might imagine that acting randomly is rather silly, but we
show later in this chapter that it can be very intelligent.
Example 7.2.7 (Agent Program).
Table-Driven Agents
Idea: We can just implement the agent function as a table and look up actions.
The table is much too large: even with n binary percepts whose order of occur-
rence does not matter, we have 2n rows in the table.
Who is supposed to write this table anyways, even if it “only” has a million
entries?
Rationality
Idea: Try to design agents that are successful! (aka. “do the right thing”)
Definition 7.3.1. A performance measure is a function that evaluates a sequence
of environments.
Example 7.3.2. A performance measure for the vacuum cleaner world could
Definition 7.3.4. An agent is called autonomous, if it does not rely on the prior
knowledge about the environment of the designer.
Autonomy avoids fixed behaviors that can become unsuccessful in a changing en-
vironment. (anything else would be
irrational)
The agent has to learning agentlearn all relevant traits, invariants, properties of the
environment and actions.
Agents
Which are agents?
(A) James Bond.
(B) Your dog.
(C) Vacuum cleaner.
(D) Thermometer.
Answer: reserved for the plenary sessions ; be there!
Environment types
Observation 7.4.1. Agent design is largely determined by the type of environment
it is intended for.
Problem:
There is a vast number of possible kinds of environments in AI.
Observation 7.4.4.
The real world is (of course) a partially observable, stochastic, sequential, dynamic,
continuous, and multi agent environment. (worst case for AI)
In the AI-2 course we will work our way from the simpler environment types to the more
general ones. Each environment type wil need its own agent types specialized to surviving and
doing well in them.
We will now discuss the main types of agents we will encounter in this course, get an impression
of the variety, and what they can and cannot do. We will start from simple reflex agents, add
state, and utility, and finally add learning. A Video Nugget covering this section can be found
at https://fau.tv/clip/id/21926.
7.5. TYPES OF AGENTS 69
Agent types
Observation: So fare we have described (and analyzed) agents only by their
behavior (cf. agent function f : P ∗ →A).
Problem:
This does not help us to build agents. (the goal of AI)
To build an agent, we need to fix an agent architecture and come up with an agent
program that runs on it.
Preview: Four basic types of agent architectures in order of increasing generality:
1. simple reflex agents
2. model based agents
3. goal based agents
4. utility based agents
All these can be turned into learning agents.
Agent Sensors
Actuators
Figure 2.10 A simple reflex agent. It acts according to a rule whose condition matches
the current state, as defined by the percept.
Problem: Simple reflex agents can only react to the perceived state of the envi-
ronment, not to changes.
Example 7.5.3. Automobile tail lights signal braking by brightening. A simple
reflex agent would have to compare subsequent percepts to realize.
Problem: Partially observable environments get simple reflex agents into trouble.
Example 7.5.4. Vacuum cleaner robot with defective location sensor ; infinite
loops.
Sensors
State
How the world evolves What the world
is like now
Environment
What my actions do
Agent Actuators
is responsible for creating the new internal state description. The details of how models and
7.5. TYPES OF AGENTS 71
(optionally) a transition model T , that predicts a new state s′′ from a state s′
and an action a .
An action function f that maps (new) states to actions.
The agent function is iteratively computed via e7→f (S(s, e)).
Note: As different percept sequences lead to different states, so the agent function
f a : P ∗ →A no longer depends only on the last percept.
Example 7.5.6 (Tail Lights Again). Model based agents can do the 96 if the
states include a concept of tail light brightness.
Problem: Having a world model does not always determine what to do (ratio-
nally).
Example 7.5.8. Coming to an intersection, where the agent has to decide between
going left and right.
Goal-based agents
Problem:
A world model does not always determine what to do (rationally).
Observation: Having a goal in mind does! (determines future actions)
Agent Schema:
52
72 Chapter
CHAPTER 7. RATIONAL 2.
AGENTS: Intelligent
AN Agents
AI FRAMEWORK
Sensors
State
What the world
How the world evolves is like now
Environment
What it will be like
What my actions do if I do action A
What action I
Goals should do now
Agent Actuators
Figure 2.13 A model-based, goal-based agent. It keeps track of the world state as well as
a set of goals it is trying to achieve, and chooses an action that will (eventually) lead to the
achievement ofMichael Kohlhase: Artificial Intelligence 2
its goals. 100 2023-02-10
Goal-based
example, theagents
taxi may be(continued)
driving back home, and it may have a rule telling it to fill up with
gas on the way home unless it has at least half a tank. Although “driving back home” may
seem to an aspect of the world state, the fact of the taxi’s destination is actually an aspect of
Definition 7.5.9. state.
the agent’s internal A goal based
If you findagent is a model
this puzzling, basedthat
consider agent withcould
the taxi transition model
be in exactly
that
Tthe deliberates
same place at theactions based
same time, but on goals to
intending and a world
reach model:
a different It employs
destination.
′
a set G of goals and a goal function f that given a (new) state s selects an
2.4.4 Goal-based agents
action a to best reach G.
Knowing something about the current state of the environment is not always enough to decide
The
whataction
to do. function is then
For example, →f junction,
at a s7road (T (s), G).
the taxi can turn left, turn right, or go straight
on. The correct decision depends on where the taxi is trying to get to. In other words, as well
GOAL
Observation:
as a current stateAdescription,
goal based theagent
agent is more
needs flexible
some in goal
sort of the knowledge it can
information that utilize.
describes
situations that
Example are desirable—for
7.5.10. example,
A goal based agentbeing
canat easily
the passenger’s destination.
be changed to go The
to aagent
new
program can combine this with the model (the same information as was used in the model-
destination, a model based agent’s rules make it go to exactly one destination.
based reflex agent) to choose actions that achieve the goal. Figure 2.13 shows the goal-based
agent’s structure.
Sometimes goal-based action
Michael Kohlhase: Artificialselection
Intelligence 2 is straightforward—for
101 example, when goal sat-
2023-02-10
isfaction results immediately from a single action. Sometimes it will be more tricky—for
example, when the agent has to consider long sequences of twists and turns in order to find a
Utility-based the goal. Search (Chapters 3 to 5) and planning (Chapters 10 and 11) are the
way to achieve agents
subfields of AI devoted to finding action sequences that achieve the agent’s goals.
Notice that decision making of this kind is fundamentally different from the condition–
Definition 7.5.11. earlier,
action rules described A utility in thatbased agent consideration
it involves uses a worldofmodel along with
the future—both a utility
“What will
function
happen ifthat
I do models its preferences
such-and-such?” and “Will among
that makethe me
states of that
happy?” world.
In the reflex It chooses
agent the
designs,
action that leadsistonotthe
this information best expected
explicitly represented, utility.
because the built-in rules map directly from
Agent Schema:
54
7.5. TYPES OF AGENTS Chapter 2. Intelligent Agents 73
Sensors
State
What the world
How the world evolves is like now
Environment
What it will be like
What my actions do if I do action A
What action I
should do now
Agent Actuators
Figure 2.14 A model-based, utility-based agent. It uses a model of the world, along with
a utility function that measures its preferences among states of the world. Then it chooses the
action that leadsMichael
to theKohlhase: Artificial Intelligence 2
best expected 102
utility, where expected 2023-02-10
utility is computed by averaging
over all possible outcome states, weighted by the probability of the outcome.
Learning Agents
Agent Schema:
Section 2.4. The Structure of Agents 55
74 CHAPTER 7. RATIONAL AGENTS: AN AI FRAMEWORK
Performance standard
Critic Sensors
feedback
Environment
changes
Learning Performance
element element
knowledge
learning
goals
Problem
generator
Actuators
Agent
vs.
Solver specific to a particular prob- vs. Solver based on description in a
lem (“domain”). general problem-description language
(e.g., the rules of any board game).
More efficient. vs. Much less design/maintenance work.
Next natural question: How do these work? (see the rest of the course)
Important Distinction: How the agent implement the wold model.
Definition 7.6.1. We call a state representation
B C
B C
Example 7.6.2. Consider the problem of finding a driving route from one end of
a country to the other via some sequence of cities.
In an atomic representation the state is represented by the name of a city.
In a factored representation we may have attributes “gps-location”, “gas”,. . .
(allows information sharing between states and uncertainty)
But how to represent a situation, where a large truck blocking the road, since it
is trying to back into a driveway, but a loose cow is blocking its path. (attribute
“TruckAheadBackingIntoDairyFarmDrivewayBlockedByLooseCow” is unlikely)
In a structured representation, we can have objects for trucks, cows, etc. and
their relationships.
Summary
Agents interact with environments through actuators and sensors.
The agent function describes what the agent does in all circumstances.
The performance measure evaluates the environment sequence.
77
79
This part introduces search-based methods for general problem solving using atomic and fac-
tored representations of states.
Concretely, we discuss the basic techniques of search-based symbolic AI. First in the shape of
classical and heuristic search and adversarial search paradigms. Then in constraint propagation,
where we see the first instances of inference-based methods.
80
Chapter 8
In this chapter, we will look at a class of algorithms called search algorithms. These are
algorithms that help in quite general situations, where there is a precisely described problem, that
needs to be solved. Hence the name “General Problem Solving” for the area.
81
82 CHAPTER 8. PROBLEM SOLVING AND SEARCH
We will use the following problem as a running example. It is simple enough to fit on one slide
and complex enough to show the relevant features of the problem solving algorithms we want to
talk about.
Oradea
71
Neamt
Zerind 87
75 151
Iasi
Arad
140
92
Sibiu Fagaras
99
118
Vaslui
80
Rimnicu Vilcea
Timisoara
142
111 Pitesti 211
Lugoj 97
70 98
85 Hirsova
Mehadia 146 101 Urziceni
75 138 86
Bucharest
Drobeta 120
90
Craiova Eforie
Giurgiu
it also limits the objectives by specifying goal states. (excludes, e.g. to stay
another couple of weeks.)
A solution is a sequence of actions that leads from the initial state to a goal state.
Problem solving computes solutions from problem formulations.
Finding the right level of abstraction and the required (not more!) information is
often the key to success.
Observation:
The formulation of problems from Definition 8.1.5 uses an (black-box) state representation. It
has enough functionality to construct the state space but nothing else. We will come back to this
in slide 117.
Remark 8.1.9. Note that search problems formalize problem formulations by making many of the
implicit constraints explicit.
S set states,
* +
A set actions,
search problem = T A×S → P(S) transition model,
I S initial state,
G S goal state
We will now specialize Definition 8.1.5 to deterministic, fully observable environments, i.e. envi-
ronments where actions only have one – assured – outcome state.
Definition 8.1.11. The predicate that tests for goal states is called a goal test.
Observation 8.1.14. Declarative descriptions are strictly more powerful than black-
box descriptions: they induce blackbox descriptions, but also allow to analyze/sim-
plify the problem.
We will come back to this later ; planning.
Note that the definition of a search problem is very general, it applies to many many real-world
problems. So we will try to characterize these by difficulty. A Video Nugget covering this
section can be found at https://fau.tv/clip/id/21928.
Problem types
Definition 8.2.1. A search problem is called a single state problem, iff it is
fully observable (at least the initial state)
deterministic (i.e. the successor of each state is determined)
static (states do not change other than by our own actions)
discrete (a countable number of states)
Definition 8.2.2. A search problem is called a multi state problem
We will explain these problem types with another example. The problem P is very simple: We
have a vacuum cleaner and two rooms. The vacuum cleaner is in one room at a time. The floor
can be dirty or clean.
The possible states are determined by the position of the vacuum cleaner and the information,
whether each room is dirty or not. Obviously, there are eight states: S = {1, 2, 3, 4, 5, 6, 7, 8} for
simplicity.
The goal is to have both rooms clean, the vacuum cleaner can be anywhere. So the set G of
goal states is {7, 8}. In the single-state version of the problem, [right, suck] shortest solution, but
[suck, right, suck] is also one. In the multiple-state version we have
L
S S
Start in 5 L
R
R L
R
R
L L
S S
Figure 3.3 The state space for the vacuum world. Links denote actions: L = Left, R =
Multiple-state Problem: Right, S = Suck.
→Any{3,
lef •t Initial state: state 7}
can be designated as the initial state.
• Actions: In this simple environment, each state has just three actions: Left, Right, and
suckSuck. Larger →environments
{7} might also include Up and Down.
• Transition model: The actions have their expected effects, except that moving Left in
the leftmost square, moving Right in the rightmost square, and Sucking in a clean square
have no effect. The complete state space is shown in Figure 3.3.
• Goal test: This checks whether all the squares are clean.
Michael Kohlhase: Artificial Intelligence 2 119step costs 1, so the path cost
• Path cost: Each 2023-02-10
is the number of steps in the path.
Compared with the real world, this toy problem has discrete locations, discrete dirt, reliable
cleaning, and it never gets any dirtier. Chapter 4 relaxes some of these assumptions.
8-PUZZLE The 8-puzzle, an instance of which is shown in Figure 3.4, consists of a 3×3 board with
eight numbered tiles and a blank space. A tile adjacent to the blank space can slide into the
Example: Vacuum-Cleaner World (continued) space. The object is to reach a specified goal state, such as the one shown on the right of the
figure. The standard formulation is as follows:
Contingency Problem:
Murphy’s Law: suck can dirty a clean
carpet 70 Chapter 3. Solving Problems by Searching
cation only
L
S S
R R
L R L R
S
R
Solution:
L R
suck → {5, 7} S S
Figure 3.3 The state space for the vacuum world. Links denote actions: L = Left, R =
right → {6, 8} Right, S = Suck.
sensing) n · 2n states.
• Initial state: Any state can be designated as the initial state.
• Actions: In this simple environment, each state has just three actions: Left, Right, and
Suck. Larger environments might also include Up and Down.
• Transition model: The actions have their expected effects, except that moving Left in
the leftmost square, moving Right in the rightmost square, and Sucking in a clean square
Michael Kohlhase: Artificial Intelligence 2 have no 120
effect. The complete state space is2023-02-10
shown in Figure 3.3.
• Goal test: This checks whether all the squares are clean.
• Path cost: Each step costs 1, so the path cost is the number of steps in the path.
Compared with the real world, this toy problem has discrete locations, discrete dirt, reliable
In the contingency version of P a solution is the following: 8-PUZZLE
cleaning, and it never gets any dirtier. Chapter 4 relaxes some of these assumptions.
The 8-puzzle, an instance of which is shown in Figure 3.4, consists of a 3×3 board with
eight numbered tiles and a blank space. A tile adjacent to the blank space can slide into the
space. The object is to reach a specified goal state, such as the one shown on the right of the
[suck{5, 7}, right → {6, 8}, suck → {6, 8}, suck{5, 7}] figure. The standard formulation is as follows:
etc. Of course, local sensing can help: narrow {6, 8} to {6} or {8}, if we are in the first, then
suck.
“Path cost”: There may be more than one solution and we might want to have the “best” one in
a certain sense.
“State”: e.g., we don’t care about tourist attractions found in the cities along the way. But this is
problem dependent. In a different problem it may well be appropriate to include such information
in the notion of state.
“Realizability”: one could also say that the abstraction must be sound wrt. reality.
Example:
Section 3.2. The
Example 8-puzzle
Problems 71
7 2 4 1 2
5 6 3 4 5
8 3 1 6 7 8
A B
Before closing this section, we should emphasize that the notion of an agent is meant to
be a tool for analyzing systems, not an absolute characterization that divides the world into
agents and non-agents. One could view a hand-held calculator as an agent that chooses the
action of displaying “4” when given the percept sequence “2 + 2 =,” but such an analysis
would hardly aid our understanding of the calculator. In a sense, all areas of engineering can
be seen as designing artifacts that interact with the world; AI operates at (what the authors
consider to be) the most interesting end of the spectrum, where the artifacts have significant
computational resources and the task environment requires nontrivial decision making.
RATIONAL AGENT A rational agent is one that does the right thing—conceptually speaking, every entry in the
States real-valued
table for the agent coordinates
function is filled of Obviously, doing the right thing is better
out correctly.
than doing the wrong
robotthing,
jointbutangles
what does it mean
and partsto do
ofthe
theright thing? to be assembled
object
Actions continuous motions of robot joints
Goal test assembly complete?
Path cost time to execute
General Problems
Question: Which are “Problems”?
8.3. SEARCH 89
8.3 Search
A Video Nugget covering this section can be found at https://fau.tv/clip/id/21956.
Arad
Arad
Arad
Arad
PARENT
Figure 3.10 Nodes are the data structures from which the search tree is constructed. Each
has a parent,
Observation: Pathsa state,
inand
thevarious bookkeeping
search fields. Arrows to
tree correspond pointpaths
from child to parent.
in the state space.
Definition 8.3.2.
Given the We define
components the node,
for a parent pathitcost of toa see
is easy node
howntoin a search
compute tree T to be
the necessary
components
the sum for a child
of the step costsnode.
on The
the function C HILDn-Nto
path from ODE takes a parent node and an action
the root of T .
and returns the resulting child node:
loop
if fringe <is empty> fail end if
node := first(fringe,strategy)
if NodeTest(State(node)) return State(node)
else fringe := insert_all(expand(node,problem),strategy)
end if
end loop
end procedure
Definition 8.3.3. The fringe is a list nodes not yet considered in tree search.
It is ordered by the strategy. (see below)
• Expand applies all operators of the problem to the current node and yields a set of new nodes.
• Insert inserts an element into the current fringe queue. This can change the behavior of the
search.
• Insert-All Perform Insert on set of elements.
Search strategies
Definition 8.3.4. A strategy is a function that picks a node from the fringe of a
search tree. (equivalently, orders the fringe and picks the first.)
Definition 8.3.5 (Important Properties of Strategies).
Note that there can be infinite branches, see the search tree for Romania.
The opposite of uninformed search is informed or heuristic search that uses a heuristicheuristic
function that adds external guidance to the search process. In the Romania example, one could
add the heuristic to prefer cities that lie in the general direction of the goal (here SE).
Even though heuristic search is usually much more efficient, uninformed search is important
nonetheless, because many problems do not allow to extract good heuristics.
Breadth-First Search
Idea: Expand the shallowest unexpanded node.
Definition 8.4.2. The breadth first search (BFS) strategy treats the fringe as a
FIFO queue, i.e. successors go in at the end of the fringe.
Example 8.4.3 (Synthetic).
B C
D E F G
H I J K L M N O
8.4. UNINFORMED SEARCH STRATEGIES 93
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
We will now apply the breadth first search strategy to our running example: Traveling in Romania.
94 CHAPTER 8. PROBLEM SOLVING AND SEARCH
Note that we leave out the green dashed nodes that allow us a preview over what the search tree
will look like (if expanded). This gives a much cleaner picture we assume that the readers already
have grasped the mechanism sufficiently.
Arad
Arad
Arad
Arad
Arad
Optimal?: No! If cost varies for different steps, there might be better solutions
below the level of the first one.
An alternative is to generate all solutions and then pick an optimal one. This works
only, if m is finite.
8.4. UNINFORMED SEARCH STRATEGIES 95
The next idea is to let cost drive the search. For this, we will need a non-trivial cost function: we
will take the distance between cities, since this is very natural. Alternatives would be the driving
time, train ticket cost, or the number of tourist attractions along the way.
Of course we need to update our problem formulation with the necessary information.
Oradea
71
Neamt
Zerind 87
75 151
Iasi
Arad
140
92
Sibiu Fagaras
99
118
Vaslui
80
Rimnicu Vilcea
Timisoara
142
111 Pitesti 211
Lugoj 97
70 98
85 Hirsova
Mehadia 146 101 Urziceni
75 138 86
Bucharest
Drobeta 120
90
Craiova Eforie
Giurgiu
Sometimes the goal is specified by an abstract property rather than an explicitly enumer-
ated set of states. For example, in chess, the goal is to reach a state called “checkmate,”
Uniform-cost search
where the opponent’s king is under attack and can’t escape.
PATH COST • A path cost function that assigns a numeric cost to each path. The problem-solving
Idea: agent chooses a cost function that reflects its own performance measure. For the agent
Expand least cost unexpanded node.
trying to get to Bucharest, time is of the essence, so the cost of a path might be its length
in kilometers.
Definition 8.4.4.InUniform-cost
this chapter, we search
assume that the cost
(UCS) of a path
is the can bewhere
strategy described
theasfringe
the is
sum of the costs of the individual actions along the path.3 The step cost of taking action
STEP COST
ordered by increasing path cost. ! !
a in state s to reach state s is denoted by c(s, a, s ). The step costs for Romania are
Note: shown in Figure 3.2 as route distances. We assume that step costs are nonnegative.4
The preceding
Equivalent elements first
to breadth definesearch
a problem
if allandstep
can costs
be gathered into a single data structure
are equal.
that is given as input to a problem-solving algorithm. A solution to a problem is an action
sequence thatExample:
Synthetic leads from the initial state to a goal state. Solution quality is measured by the
OPTIMAL SOLUTION path cost function, and an optimal solution has the lowest path cost among all solutions.
Arad
140 118 75
Sibiu Timisoara Zerind
71 75
Oradea Arad
Arad
140 118 75
Sibiu Timisoara Zerind
118 111 71 75
Arad
140 118 75
Sibiu Timisoara Zerind
140 99 151 80 118 111 71 75
Note that we must sum the distances to each leaf. That is, we go back to the first level after the
third step.
If step cost is negative, the same situation as in breadth first search can occur: later solutions
may be cheaper than the current one.
If step cost is 0, one can run into infinite branches. UCS then degenerates into depth first
search, the next kind of search algorithm we will encounter. Even if we have infinite branches,
where the sum of step costs converges, we can get into trouble, since the search is forced down
these infinite paths before a solution can be found.
Worst case is often worse than BFS, because large trees with small steps tend to be searched
first. If step costs are uniform, it degenerates to BFS.
Depth-first Search
Idea: Expand deepest unexpanded node.
8.4. UNINFORMED SEARCH STRATEGIES 97
Definition 8.4.5. Depth-first search (DFS) is the strategy where the fringe is
organized as a (LIFO) stack i.e. successor go in at front of the fringe.
Note: Depth first search can perform infinite cyclic excursions
Need a finite, non cyclic state space (or repeated state checking)
Depth-First Search
Example 8.4.6 (Synthetic).
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
98 CHAPTER 8. PROBLEM SOLVING AND SEARCH
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
8.4. UNINFORMED SEARCH STRATEGIES 99
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
Arad
Arad
Arad
Arad
A A
A A A A
B C B C B C B C
A A A A
B C B C B C B C
D E F G D E F G D E F G D E F G
A A A A
B C B C B C B C
D E F G D E F G D E F G D E F G
102 CHAPTER 8. PROBLEM SOLVING AND SEARCH
Completeness Yes
Time complexity (d + 1) · b0 + d · b1 + (d − 1) · b2 + . . . + bd ∈O(bd+1 )
Space complexity O(b · d)
Optimality Yes (if step cost = 1)
Consequence: IDS used in practice for search spaces of large, infinite, or unknown
depth.
Note:
To find a solution (at depth d) we have to search the whole tree up to d. Of course since we
do not save the search state, we have to re-compute the upper part of the tree for the next level.
This seems like a great waste of resources at first, however, IDS tries to be complete without the
space penalties.
However, the space complexity is as good as DFS, since we are using DFS along the way. Like
in BFS, the whole tree on level d (of optimal solution) is explored, so optimality is inherited from
there. Like BFS, one can modify this to incorporate uniform cost search behavior.
As a consequence, variants of IDS are the method of choice if we do not have additional
information.
Kohlhase:
Kohlhase:Künstliche
KünstlicheIntelligenz 1 1
Intelligenz 150150 JulyJuly
5, 2018
5, 2018
Graph versions of all the tree search algorithms considered here exist, but are more
difficult to understand (and to prove properties about).
The (time complexity) properties are largely stable under duplicate pruning. (no
gain in the worst case)
induced by a search problem in search of a goal state. Search strategies only differ
by the treatment of the fringe.
Search Strategies and their Properties: We have discussed
Best-first search
Idea: Order the fringe by estimated “desirability” (Expand most desirable
unexpanded node)
Definition 8.5.1. An evaluation function assigns a desirability value to each node
of the search tree.
Note: A evaluation function is not part of the search problem, but must be added
externally.
Definition 8.5.2. In best first search, the fringe is a queue sorted in decreasing
order of desirability.
Special cases: Greedy search, A∗ search
This is like UCS, but with evaluation function related to problem at hand replacing the path cost
function.
If the heuristics is arbitrary, we expect incompleteness!
Depends on how we measure “desirability”.
Concrete examples follow.
Greedy search
Idea: Expand the node that appears to be closest to the goal.
In greedy search we replace the objective cost to construct the current solution with a heuristic or
subjective measure from which we think it gives a good idea how far we are from a solution. Two
things have shifted:
• we went from internal (determined only by features inherent in the search space) to an exter-
nal/heuristic cost
• instead of measuring the cost to build the current partial solution, we estimate how far we are
from the desired goal
Oradea
71
Neamt
Zerind 87
75 151
Iasi
Arad
140
92
Sibiu Fagaras
99
118
Vaslui
80
Rimnicu Vilcea
Timisoara
142
111 Pitesti 211
Lugoj 97
70 98
85 Hirsova
Mehadia 146 101 Urziceni
75 138 86
Bucharest
Drobeta 120
90
Craiova Eforie
Giurgiu
Arad
366
Sibiu Timisoara Zerind
253 329 374
Arad
366
Sibiu Timisoara Zerind
253 329 374
Arad Fagaras Oradea R. Vilcea
Arad
366
Sibiu Timisoara Zerind
253 329 374
Arad Fagaras Oradea R. Vilcea
Sibiu Bucharest
253 0
Heuristic
HeuristicFunctions
FunctionsininPath
PathPlanning
Planning
Example 8.5.7 (The maze solved).
I We
Example
indicate4.4
h∗ (The maze
by giving thesolved).
goal distance
We indicate h⇤ by giving the goal distance
I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 24 18 17 16 15 14 12 13 14 15 16 17 18
2 23 19 18 17 13 12 11 12 13 14 15 16 17
3 22 21 20 16 12 11 10
4 23 22 21 15 14 13 9 8 4 3 2 1
5 24 23 22 16 15 9 8 7 6 5 1 0
G
I Example 4.5 (Maze Heuristic: the good case).
Example 8.5.8
We use the (Maze Heuristic:
Manhattan distance to the goodascase).
the goal a heuristic
We use the Manhattan distance to the goal as a heuristic
I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4
2 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3
3 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
4 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
5 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
G
Kohlhase: Künstliche Intelligenz 1 160 July 5, 2018
Example 8.5.10. Greedy search can get stuck going from Iasi to Oradea:
Iasi → Neamt
68 → Iasi → Neamt → · · · Chapter 3. Solving Problems by Searching
Oradea
71
Neamt
Zerind 87
75 151
Iasi
Arad
140
92
Sibiu Fagaras
99
118
Vaslui
80
Rimnicu Vilcea
Timisoara
142
111 Pitesti 211
Lugoj 97
70 98
85 Hirsova
Mehadia 146 101 Urziceni
75 138 86
Bucharest
Drobeta 120
90
Craiova Eforie
Giurgiu
Sometimes the goal is specified by an abstract property rather than an explicitly enumer-
ated set of states. For example, in chess, the goal is to reach a state called “checkmate,”
where the opponent’s king is under attack and can’t escape.
PATH COST • A path cost function that assigns a numeric cost to each path. The problem-solving
agent chooses a cost function that reflects its own performance measure. For the agent
trying to get to Bucharest, time is of the essence, so the cost of a path might be its length
in kilometers. In this chapter, we assume that the cost of a path can be described as the
STEP COST sum of the costs of the individual actions along the path.3 The step cost of taking action
8.5. INFORMED SEARCH STRATEGIES 109
Worst-case Time:
Same as depth first search.
Worst-case Space:
Same as breadth first search.
But: A good heuristic can give dramatic improvements.
Remark 8.5.11.
Greedy Search is similar to UCS. Unlike the latter, the node evaluation function has nothing to
do with the nodes explored so far. This can prevent nodes from being enumerated systematically
as they are in UCS and BFS.
For completeness, we need repeated state checking as the example shows. This enforces com-
plete enumeration of state space (provided that it is finite), and thus gives us completeness.
Note that nothing prevents from all nodes being searched in worst case; e.g. if the heuristic
function gives us the same (low) estimate on all nodes except where the heuristic mis-estimates
the distance to be high. So in the worst case, greedy search is even worse than BFS, where d
(depth of first solution) replaces m.
The search procedure cannot be optimal, since actual cost of solution is not considered.
For both, completeness and optimality, therefore, it is necessary to take the actual cost of
partial solutions, i.e. the path cost, into account. This way, paths that are known to be expensive
are avoided.
Heuristic Functions
Definition 8.5.12. Let Π be a problem with states S. A heuristic function (or
short heuristic) for Π is a function h : S→R+
0 ∪ {∞} so that h(s) = 0 whenever s
is a goal state.
h(s) is intended as an estimate between state s and the nearest goal state.
Definition 8.5.13. Let Π be a problem with states S, then the function h∗ : S→R+ 0∪
{∞}, where h∗ (s) is the cost of a cheapest path from s to a goal state, or ∞ if no
such path exists, is called the goal distance function for Π.
Notes:
h(s) = 0 on goal states: If your estimator returns “I think it’s still a long way”
on a goal state, then its “intelligence” is, um . . .
Return value ∞: To indicate dead ends, from which the goal can’t be reached
anymore.
The distance estimate depends only on the state s, not on the node (i.e., the
path we took to reach s).
same word often used for “rule of thumb” or “imprecise solution method”.
to a goal state.
1. base case
1.1. h(s) = 0 by definition of heuristic, so h(s)≤h∗ (s) as desired.
2. step case
2.1. We assume that h(s′ )≤h∗ (s) for all states s′ with a cheapest goal path of length
n.
2.2. Let s be a state whose cheapest goal path has length n+1 and the first transition
is o = (s,s′ ).
2.3. By consistency, we have h(s) − h(s′ )≤c(o) and thus h(s)≤h(s′ ) + c(o).
2.4. By construction, h∗ (s) has a cheapest goal path of length n and thus, by induc-
tion hypothesis h(s′ )≤h∗ (s′ ).
2.5. By construction, h∗ (s) = h∗ (s′ ) + c(o).
2.6. Together this gives us h(s)≤h∗ (s) as desired.
Thus f (n) is the estimated total cost of the path through n to a goal.
Definition 8.5.20. Best first search with evaluation function g + h is called A∗
search.
112 CHAPTER 8. PROBLEM SOLVING AND SEARCH
This works, provided that h does not overestimate the true cost to achieve the goal. In other
words, h must be optimistic wrt. the real cost h∗ . If we are too pessimistic, then non-optimal
solutions have a chance.
A∗ Search: Optimality
Theorem 8.5.21. A∗ search with admissible heuristic is optimal.
Proof: We show that sub-optimal nodes are never selected by A∗
1. Suppose a suboptimal goal G has been generated then we are in the following
situation:
start
n
O G
A∗ Search Example
Arad
366=0+366
Arad
Arad
Arad
Arad
646=280+366 671=291+380
Arad
646=280+366 671=291+380
Additional
I Example Observations (Not Limited to Path Planning)
4.4 (The maze solved).
We indicate h⇤ by giving the goal distance
I Example
Example 8.5.22
4.5 (Maze Heuristic:
(Greedy thesearch,
best-first good case).
“good case”).
We use the Manhattan distance to the goal as a heuristic
I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 18 16 15 14 13 12 10 9 8 7 6 5 4
2 17 15 14 13 11 10 9 8 7 6 5 4 3
3 16 15 14 12 10 9 8
4 15 14 13 11 10 9 7 6 4 3 2 1
5 14 13 12 10 9 7 6 5 4 3 1 0
I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 18 22 22 22 22 22 24 24 24 24 24 24 24
2 18 20 20 20 22 22 22 22 22 22 22 22 22
3 18 18 18 20 22 22 22
4 18 18 18 20 20 20 22 22 24 24 24 24
5 18 18 18 20 20 24 22 22 22 22 24 24
G
I A⇤ with
∗ a consistent heuristic g + h always increases monotonically (h cannot
In A with a consistent heuristic, g + h always increases monotonically (h
decrease mor than g increases)
cannot decrease more than g increases)
I We need more search, in the “right upper half”. This is typical: Greedy best-first
We need
search more
tends to besearch, in the
faster than A⇤“right
. upper half”. This is typical: Greedy best
∗
first search tends to be faster than A .
Heuristic Functions in Path Planning
Kohlhase: Künstliche Intelligenz 1 177 July 5, 2018
Michael Kohlhase: Artificial Intelligence 2 167 2023-02-10
I Example 4.4 (The maze solved).
We indicate h⇤ by giving the goal distance
Additional Observations
I Example 4.5 (NottheLimited
(Maze Heuristic: to Path Planning)
good case).
We use the Manhattan distance to the goal as a heuristic
I Example
Example 8.5.24
4.6 (Maze Heuristic:
(Greedy thesearch,
best-first bad case).
“bad case”).
We use the Manhattan distance to the goal as a heuristic again
I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4
2 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3
3 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
4 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
5 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
G
Kohlhase: Künstliche Intelligenz 1 160 July 5, 2018
I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 18 17 24 24 24 24 24 24 24 24 24 24 24 24 24
2 18 16 22 14 13 12 11 10 9 8 7 6 5 4 24
3 18 15 20 13 22 22 22 9 26 26 26 5 30 3 24
4 18 18 18 12 20 10 22 8 24 6 26 4 28 2 24
5 18 18 18 18 18 9 22 22 22 5 26 26 26 1 24
G
We will search less of the “dead-end street”. Sometimes g + h gives better
We will search
search lessthan
guidance of the
h. “dead-end street”. Sometimes g +
(;h gives
A⇤ is better search
faster there)
guidance than h. (; A∗ is faster there)
AdditionalObservations
Additional Observations(Not
(NotLimited
LimitedtotoPath
PathPlanning)
Planning) V
I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 24 17 24 24 24 24 24 24 24 24 24 24 24 24 24
2 24 16 24 14 13 12 11 10 9 8 7 6 5 4 24
3 24 15 24 13 34 36 38 9 50 52 54 5 66 3 24
4 24 24 24 12 32 10 40 8 48 6 56 4 64 2 24
5 26 26 26 28 30 9 42 44 46 5 58 60 62 1 24
G
⇤
In A , node values always increase monotonically (with any heuristic). If the
A∗ , node
Inheuristic is values
perfect,always increaseconstant
they remain monotonically (withpaths.
on optimal any heuristic). If the heuris-
tic is perfect, they remain constant on optimal paths.
A∗ search: f -contours
A∗ gradually adds “f -contours” of nodes
Section 3.5. Informed (Heuristic) Search Strategies 97
116 CHAPTER 8. PROBLEM SOLVING AND SEARCH
Z N
A I
380 S
F
V
400
T R
L P
H
M U
420 B
D
C E
G
around the start state. With more accurate heuristics, the bands will stretch toward the goal
and become more narrowly focused around the optimal path. If C ∗ is the cost of the
8.5.4stateFinding Good Heuristics
optimal solution path, then we can say the following:
A Video Nugget covering this subsection can be found at https://fau.tv/clip/id/22021.
• A∗ expands all nodes with f (n) < C ∗ .
∗
A might then ∗
Since the• availability of expand someheuristics
admissible of the nodes right
is so on the “goal
important for contour”
informed(where (n) = C ) for
searchf(particularly
A∗ ), let us before
see how selecting a goal node.
such heuristics can be obtained in practice. We will look at an example, and
then derive a general procedure from that.
Completeness requires that there be only finitely many nodes with cost less than or equal to
∗ , a condition that is true if all step costs exceed some finite ! and if b is finite.
CAdmissible
Section 3.2. heuristics: Example 8-puzzle
Notice thatExample Problems
A∗ expands 71
no nodes with f (n) > C ∗ —for example, Timisoara is not
expanded in Figure 3.24 even though it is a child of the root. We say that the subtree below
7 2 4 1 2
5 6 3 4 5
8 3 1 6 7 8
Example 8.5.28. Let h2 (n) be the total Manhattan distance from desired location
of each tile. (h2 (S) = 2 + 0 + 3 + 1 + 0 + 1 + 3 + 4 = 14)
Observation 8.5.29 (Typical search costs). (IDS =
b iterative deepening search)
Dominance
Definition 8.5.30. Let h1 and h2 be two admissible heuristics we say that h2
dominates h1 if h2 (n)≥h1 (n) for all n.
Theorem 8.5.31. If h2 dominates h1 , then h2 is better for search than h1 .
Relaxed problems
Observation: Finding good admissible heuristics is an art!
Idea: Admissible heuristics can be derived from the exact solution cost of a relaxed
version of the problem.
Example 8.5.32. If the rules of the 8-puzzle are relaxed so that a tile can move
anywhere, then we get heuristic h1 .
Example 8.5.33. If the rules are relaxed so that a tile can move to any adjacent
square, then we get heuristic h2 . (Manhattan distance)
Definition 8.5.34. Let Π:=⟨S , A, T , I , G ⟩ be a search problem, then we call
a search problem P r :=⟨S, Ar , T r , I r , G r ⟩ a relaxed problem (wrt. Π; or simply
relaxation of Π), iff A ⊆ Ar , T ⊆ T r , I ⊆ I r , and G ⊆ G r .
Lemma 8.5.35. If Pr relaxes Π, then every solution for Π is one for Pr.
Key point: The optimal solution cost of a relaxed problem is not greater than the
optimal solution cost of the real problem.
Relaxation means to remove some of the constraints or requirements of the original problem,
so that a solution becomes easy to find. Then the cost of this easy solution can be used as an
optimistic approximation of the problem.
See http://qiao.github.io/PathFinding.js/visual/
Difference to Breadth-first Search?: That would explore all grid cells in a circle
around the initial state!
Example 8.6.2.
All tree search algorithms (except pure depth first search) are systematic. (given
reasonable assumptions e.g. about costs.)
Observation 8.6.3. Systematic search algorithms are complete.
Observation 8.6.4. In systematic search algorithms there is no limit of the number
of nodes that are kept in memory at any time.
Alternative: Keep only one (or a few) nodes at a time ;
no systematic exploration of all options, ; incomplete.
Local
Local Search:
Search: Iterative
Iterative improvementalgorithms
improvement algorithms
Local Search: Iterative improvement algorithms
Definition 8.6.7 (Traveling Salesman Problem). Find shortest trip through set
ofIcities such that
Definition each city isSalesman
5.7 (Traveling visited exactly once. Find shortest trip through set
Problem).
of cities such that each city is visited exactly once.
I Idea:
Start 5.7
Definition with(Traveling
any complete tour, perform
Salesman pairwise
Problem). Findexchanges
shortest trip through set
I Idea: Start with any complete tour, perform pairwise exchanges
of cities such that each city is visited exactly once.
I Idea: Start with any complete tour, perform pairwise exchanges
loop
neighbor := <a highest−valued successor of current>
if Value[neighbor] < Value[current] return [current] end if
current := neighbor
end loop
end procedure
Intuition:
Like best first search without memory.
In order to understand the procedure on a more intuitive level, let us consider the following
scenario: We are in a dark landscape (or we are blind), and we want to find the highest hill. The
search procedure above tells us to start our search anywhere, and for every step first feel around,
and then take a step into the direction with the steepest ascent. If we reach a place, where the
next step would take us down, we are finished.
Of course, this will only get us into local maxima, and has no guarantee of getting us into
global ones (remember, we are blind). The solution to this problem is to re-start the search at
random (we do not have any information) places, and hope that one of the random jumps will get
us to a slope that leads to a global maximum.
If the path to the goal does not matter, we might consider a different class of algo-
Example 8.6.10. An 8-queens rithms,
stateones that do not worry about paths at all. Local search algorithms operate using
LOCAL SEARCH
a single current node (rather than multiple paths) and generally move only to neighbors
with heuristic cost estimate h of
=that17node. Typically, the paths followed by the search are not retained. Although local
CURRENT NODE
Recent work on hill climbing algorithms tries to combine complete search with randomization to
escape certain odd phenomena occurring in statistical distribution of solutions.
124 Chapter 4. Beyond Classical Search
Simulated annealing (Idea)
Annealing is the process of heating steel and let it cool gradually to give it time to
grow an optimal cristal structure.
Figure 4.4 Illustration of why ridges cause difficulties for hill climbing. The grid of states
(dark circles) is superimposed on a ridge rising from left to right, creating a sequence of local
Simulated annealing is like shaking a ping
maxima pong
that are notball occasionally
directly connected tooneach
a bumpy surface
other. From each local maximum, all the
to free it. available actions point downhill. (so it does not get stuck)
Devised by Metropolis
STOCHASTIC HILL et al for physical process modelling [Met+53]
Many variants of hill climbing have been invented. Stochastic hill climbing chooses at
CLIMBING
random from among the uphill moves; the probability of selection can vary with the steepness
Widely used in VLSI layout, airline scheduling, etc.
of the uphill move. This usually converges more slowly than steepest ascent, but in some
FIRST-CHOICE HILL
CLIMBING state landscapes, it finds better solutions. First-choice hill climbing implements stochastic
Michael Kohlhase:hill climbing
Artificial Intelligence by
2 generating successors
183 randomly until one is generated that is better than the
2023-02-10
current state. This is a good strategy when a state has many (e.g., thousands) of successors.
The hill-climbing algorithms described so far are incomplete—they often fail to find
Simulated annealing (Implementation) a goal when one exists because they can get stuck on local maxima. Random-restart hill
RANDOM-RESTART
HILL CLIMBING climbing adopts the well-known adage, “If at first you don’t succeed, try, try again.” It con-
ducts a series of hill-climbing searches from randomly generated initial states,1 until a goal
Definition 8.6.13. The following is found. algorithmIt is triviallyis complete
called simulated annealing:
with probability approaching 1, because it will eventually
generate a goal state
procedure Simulated−Annealing (problem,schedule) /∗ a solution state ∗/as the initial state. If each hill-climbing search has a probability p of
local node, next /∗ nodessuccess, ∗/ then the expected number of restarts required is 1/p. For 8-queens instances with
local T /∗ a ‘‘temperature’’ no sidewayscontrolling movesprob.~of
allowed, pdownward
≈ 0.14, so we
steps need ∗/roughly 7 iterations to find a goal (6 fail-
ures and
current := Make−Node(Initial−State[problem]) 1 success). The expected number of steps is the cost of one successful iteration plus
for t :=1 to ∞ (1−p)/p times the cost of failure, or roughly 22 steps in all. When we allow sideways moves,
T := schedule[t] 1/0.94 ≈ 1.06 iterations are needed on average and (1 × 21) + (0.06/0.94) × 64 ≈ 25 steps.
if T = 0 return current For 8-queens,
end if then, random-restart hill climbing is very effective indeed. Even for three mil-
lion queens, the approach can find solutions in under a minute. 2
next := <a randomly selected successor of current>
∆(E) := Value[next]−Value[current]
1 Generating a random state from an implicitly specified state space can be a hard problem in itself.
end for
end procedure
E(x)
=e kT ≫1
e kT
for small T .
Question: Is this necessarily an interesting guarantee?
Figure 4.6 The genetic algorithm, illustrated for digit strings representing 8-queens states.
24748552
The 24 31%
initial population in (a) is 32752411
32748552 32748152
ranked by the fitness function in (b), resulting in pairs for
Michael Kohlhase: Artificial Intelligence 2 187 2023-02-10
mating in (c). They
32752411 23 29% produce offspring
24748552 in (d), which are subject
24752411 to mutation in (e).
24752411
24415124 20 26% 32752411 32752124 32252124
Genetic algorithms
32543213 11 (continued)
14% 24415124 24415411 24415417
Figure 4.7 The 8-queens states corresponding to the first two parents in Figure 4.6(c) and
the first offspring in Figure 4.6(d). The shaded columns are lost in the crossover step and the
unshaded columns are retained.
+ =
Like beam searches, GAs begin with a set of k randomly generated states, called the
POPULATION population. Each state, or individual, is represented as a string over a finite alphabet—most
INDIVIDUAL commonly, a string of 0s and 1s. For example, an 8-queens state must specify the positions of
8 queens,Figure
each 4.7 The 8-queens
in a column of 8 states
squares, corresponding to the first8two
and so requires parents
× log in Figure 4.6(c) and
2 8 = 24 bits. Alternatively,
the first offspring in Figure 4.6(d). The shaded columns are lost in the crossover step and the
Note: Genetic
the state could be algorithms
represented
unshaded columns are retained.
notequal
as 8 digits, evolution:
each in the range e.g.,
from real
1 togenes
8. (We also encodelater
demonstrate repli-
cation
that machinery!
the two encodings behave differently.) Figure 4.6(a) shows a population of four 8-digit
strings representing 8-queens states.
TheLike
production of the next
beam searches, GAs generation
begin with aofsetstates is shown generated
of k randomly in Figurestates,
4.6(b)–(e). In (b),
called the
FITNESS FUNCTION
POPULATION eachpopulation.
state is rated Eachby state,
the
Michael objective
Kohlhase: Artificial function,
or individual, 2 or (in GA
is represented
Intelligence as aterminology)
string
188 the fitness
over a finite function. A
alphabet—most
2023-02-10
9.1 Introduction
Video Nuggets covering this section can be found at https://fau.tv/clip/id/22060 and
https://fau.tv/clip/id/22061.
125
126 CHAPTER 9. ADVERSARIAL SEARCH FOR GAME PLAYING
An Example Game
9.1. INTRODUCTION 127
Answer: Declarative!
With “game description language” =
b natural language.
Given limited time, the best we can do is look ahead as far as we can. Evaluation
functions tell us how to evaluate the leaf states at the cut off.
Alpha-Beta Search: How to prune unnecessary parts of the tree?
Often, we can detect early on that a particular action choice cannot be part of
the optimal strategy. We can then stop considering this part of the game tree.
State of the art: What is the state of affairs, for prominent games, of computer
game playing vs. human experts?
Just FYI (not part of the technical content of this course).
“Minimax”?
We want to compute an optimal strategy for player “Max”.
In other words: We are Max, and our opponent is Min.
Recall:
We compute the strategy offline, before the game begins. During the game,
whenever it’s our turn, we just lookup the corresponding action.
Max attempts to maximize the utility u(s) of the terminal state that will be
reached during play.
Min attempts to minimize u(s).
Section 5.2. Optimal Decisions in Games 163
So what?
until we reach leaf nodes corresponding to terminal states such that one player has three in
aThe computation
row or all the squares alternates between
are filled. The minimization
number on each leaf nodeand maximization
indicates ;
the utility value hence
“minimax”.
of the terminal state from the point of view of MAX; high values are assumed to be good for
MAX and bad for MIN (which is how the players get their names).
For tic-tac-toe the game tree is relatively small—fewer than 9! = 362, 880 terminal
Michael Kohlhase: Artificial Intelligence 2 200 2023-02-10
nodes. But for chess there are over 1040 nodes, so the game tree is best thought of as a
theoretical construct that we cannot realize in the physical world. But regardless of the size
SEARCH TREE of the game tree, it is MAX’s job to search for a good move. We use the term search tree for a
Example Tic-Tac-Toe
tree that is superimposed on the full game tree, and examines enough nodes to allow a player
to determine what move to make.
MAX (X)
X X X
MIN (O) X X X
X X X
XO X O X ...
MAX (X) O
X O X X O X O ...
MIN (O) X X
X O X X O X X O X ...
TERMINAL O X O O X X
O X X O X O O
Utility –1 0 +1
Figure 5.1 A (partial) game tree for the game of tic-tac-toe. The top node is the initial
Game state,
tree,and
current player
MAX moves marked
first, placing an Xon the
in an left.square. We show part of the tree, giving
empty
alternating moves by MIN ( O ) and MAX ( X ), until we eventually reach terminal states, which
Last row:
can beterminal positions
assigned utilities with
according their
to the rulesutility.
of the game.
Minimax: Outline
In a normal search problem, the optimal solution would be a sequence of actions leading to
a goal state—a terminal state that is a win. In adversarial search, MIN has something to say
STRATEGY about it. MAX therefore must find a contingent strategy, which specifies MAX’s move in
the initial state, then MAX’s moves in the states resulting from every possible response by
9.2. MINIMAX SEARCH 131
Minimax: Example
Max 3
3 12 8 2 4 6 14 5 2
Max 3−∞
Min ∞
3 Min ∞
2 Min ∞
2514
3 12 8 2 4 6 14 5 2
Note: The maximal possible pay-off is higher for the rightmost branch, but assuming
perfect play of Min, it’s better to go left. (Going right would be “relying on your
opponent to do something stupid”.)
There’s no need to re-run minimax for every game state: Run it once, offline
before the game starts. During the actual game, just follow the branches taken
in the tree. Whenever it’s your turn, choose an action maximizing the value of
the successor states.
9.3. EVALUATION FUNCTIONS 133
Analogy to heuristic functions (cf. section 8.5): We want f to be both (a) accurate
and (b) fast.
Another analogy: (a) and (b) are in contradiction ; need to trade-off accuracy
against overhead.
Max 3
3 12 8 2 4 6 14 5 2
Example Chess
This assumes that the features (their contribution towards the actual value of the state) are
independent. That’s usually not the case (e.g. the value of a Rook depends on the Pawn struc-
ture).
Max 3
3 12 8 2 4 6 14 5 2
Max ≥3
3 12 8 2
Alpha Pruning
What is α? For each search node n, the highest Max-node utility that search has
encountered on its path from the root to n.
Max −∞;
3; α =α 3= −∞
Min ∞;αα==−∞
3; −∞ Min ∞;αα==33
2; Min
3 12 8 2
How to use α?: In a Min node n, if one of the successors already has utility ≤ α,
then stop considering n. (Pruning out its remaining successors.)
Alpha-Beta Pruning
Recall:
What is α: For each search node n, the highest Max-node utility that search
has encountered on its path from the root to n.
How to use α: In a Min node n, if one of the successors already has utility
≤ α, then stop considering n. (Pruning out its remaining successors.)
Idea: We can use a dual method for Min:
What is β: For each search node n, the lowest Min-node utility that search has
encountered on its path from the root to n.
138 CHAPTER 9. ADVERSARIAL SEARCH FOR GAME PLAYING
How to use β: In a Max node n, if one of the successors already has utility
≥ β, then stop considering n. (Pruning out its remaining successors.)
. . . and of course we can use both together!
Note: Note that α only gets assigned a value in Max nodes, and β only gets assigned a value in
Min nodes.
Max −∞;
3; [3, ∞]
[−∞, ∞]
Min ∞;[−∞,
3; [−∞,3]∞] Min ∞;[3,
2; [3,2]∞] Min ∞;[3,
14;
5;
2; ∞]
[3,2]
5]14]
3 12 8 2 14 5 2
9.4. ALPHA-BETA SEARCH 139
Note: We could have saved work by choosing the opposite order for the successors
of the rightmost Min node. Choosing the best moves (for each of Max and Min)
first yields more pruning!
Max 3; [3, ∞]
Max −∞;
14; [14,
[3,5]5]
3 12 8 2 5 2
14
And now . . .
Definition 9.5.1. For Monte Carlo sampling we evaluate actions through sampling.
When deciding which action to take on game state s:
while time not up do
select action a applicable to s
run a random sample from a until terminal state t
return an a for s with maximal average u(t)
Definition 9.5.2. For the Monte Carlo tree search algorithm (MCTS) we maintain
a search tree T , the MCTS tree.
while time not up do
apply actions within T to select a leaf state s′
select action a′ applicable to s′ , run random sample from a′
add s′ to T , update averages etc.
return an a for s with maximal average u(t)
When executing a, keep the part of T below a.
9.5. MONTE-CARLO TREE SEARCH (MCTS) 141
This looks only at a fraction of the search tree, so it is crucial to have good guidance where to go,
i.e. which part of the search tree to look at.
Expansions: 0, 0, 0
avg. reward: 0, 0, 0 Expan-
sions: 0, 1, 0
avg. reward: 0, 10, 0 Ex-
pansions: 1, 1, 0
avg. reward: 70, 10, 0 Ex-
pansions: 1, 1, 1
avg. reward: 70, 10, 40 Ex-
pansions: 1, 1, 2
avg. reward: 70, 10, 35 Ex- Expansions: 0, 0
pansions: 2, 1, 2 avg. reward: 0, 0
avg. reward: 60, 10, 35 Ex-
pansions: 2, 2, 2
avg. reward: 60, 55, 35 Ex-
pansions: 2, 2, 2
avg. reward: 60, 55, 35 40
70 50 30
100 10
The sampling goes middle, left, right, right, left, middle. Then it stops and selects the highest-
average action, 60, left. After first sample, when values in initial state are being updated, we
have the following “expansions” and “avg. reward fields”: small number of expansions favored for
exploration: visit parts of the tree rarely visited before, what is out there? avg. reward: high
values favored for exploitation: focus on promising parts of the search tree.
Expansions: 0, 0, 0
avg. reward: 0, 0, 0 Expan-
sions: 0, 1, 0
avg. reward: 0, 10, 0 Expan-
sions: 1, 1, 0
avg. reward: 70, 10, 0 Ex-
Expansions: 1, 0 Expansions: 1 pansions: 1, 1, 1
avg. reward: 70, 0 Ex- avg. reward: 10 avg. reward: 70, 10, 40 Ex-
pansions: 2, 0 Expansions: 2 pansions: 1, 1, 2
avg. reward: 60, 0 avg. reward: 55 avg. reward: 70, 10, 35 Ex-
Expansions: 1, 0
Expansions: 1 pansions: 2, 1, 2 avg. reward: 40, 0 Ex-
avg. reward: 100 avg. reward: 60, 10, 35 Ex- 2, 0
pansions:
pansions: 2, 2, 2 avg. reward: 35, 0
avg. reward: 60, 55, 35 Ex-
Expansions: 0, 1 pansions: 2, 2, 2 Expansions: 0, 1
avg. reward: 0, 50 avg. reward: 60, 55, 35 avg. reward: 0, 30
40
70 50 30
100 10
This is the exact same search as on previous slide, but incrementally building the search tree,
by always keeping the first state of the sample. The first three iterations middle, left, right, go
to show the tree extension; do point out here that, like the root node, the nodes added to the
tree have expansions and avg reward counters for every applicable action. Then in next iteration
right, after 30 leaf node was found, an important thing is that the averages get updated *along
the entire path*, i.e., not only in the root as we did before, but also in the nodes along the way.
After all six iterations have been done, as before we select the action left, value 60; but we keep
the part of the tree below that action, “saving relevant work already done before”.
AlphaGo: Overview
Definition 9.5.5 (Neural Networks in AlphaGo).
Policy networks: Given a state s, output a probability distribution over the
actions applicable in s.
Value networks: Given a state s, output a number estimating the game value
of s.
a b
Rollout policy SL policy network RL policy network Value network Policy network Value network
Neural network
Policy gradient
n
Cla
tio
Se
n
ca
ssio
ssifi
lf P
ssifi
lay
gre
ca
Cla
tio
Re
n
Data
s s′
Human expert positions Self-play positions
Figure 1 | Neural network training pipeline and architecture. a, A fast the current player wins) in positions from the self-play data set.
rollout policy pπ and supervised learning (SL) policy network pσ are b, Schematic representation of the neural network architecture used in
trained to predict human expert moves in a data set of positions. AlphaGo. The policy network takes a representation of the board position
Illustration taken from [Sil+16] .
A reinforcement learning (RL) policy network pρ is initialized to the SL s as its input, passes it through many convolutional layers with parameters
policy network, and is then improved by policy gradient learning to σ (SL policy network) or ρ (RL policy network), and outputs a probability
maximize the outcome (that is, winning more games) against previous distribution pσ (a | s) or pρ (a | s) over legal moves a, represented by a
versions Rollout policy p : Simple but fast, ≈ prior work on Go.
π set is generated by playing
of the policy network. A new data probability map over the board. The value network similarly uses many
games of self-play with the RL policy network. Finally, a value network vθ convolutional layers with parameters θ, but outputs a scalar value vθ(s′)
is trainedSL policy network p : Supervised learning, human-expert data (“learn to choose
by regression to predict the expectedσoutcome (that is, whether that predicts the expected outcome in position s′.
an expert action”).
sampled state-action pairs (s, a), using stochastic gradient ascent to and its weights ρ are initialized to the same values, ρ = σ. We play
maximize
RL policyof network
the likelihood the human move pρ :a selected in state s
Reinforcement games betweenself-play
learning, the current policy topρwin”).
network
(“learn and a randomly selected
previous iteration of the policy network. Randomizing from a pool
∂log pσ (a | s ) of opponents
Value network ∆σ ∝ vθ : Use self-play games with pρ asin training
this way stabilizes
datatraining by preventing overfitting
for game-position
∂σ to the current policy. We use a reward function r(s) that is zero for all
evaluation vθ (“predict which player willnon-terminal win in this state”).
time steps t < T. The outcome zt = ± r(sT) is the termi-
We trained a 13-layer policy network, which we call the SL policy nal reward at the end of the game from the perspective of the current
network, from 30 million positions from the KGS Go Server. The net- player at time step t: +1 for winning and −1 for losing. Weights are
work predicted expert moves on a held out test set with an accuracy of then updated at each time step t by stochastic gradient ascent in the
57.0% using all input features, and 55.7% using only raw board posi- direction that 228
Michael Kohlhase: Artificial Intelligence 2 maximizes expected outcome
2023-02-10
25
tion and move history as inputs, compared to the state-of-the-art from
24
other research groups of 44.4% at date of submission (full results in ∂log pρ (a t | s t )
CommentsExtendedon Datathe Figure:
Table 3). Small improvements in accuracy led to large ∆ρ ∝ zt
improvements in playing strength (Fig. 2a); larger networks achieve ∂ρ
a A fastbetter
rollout
accuracy policy pπ and
but are slower supervised
to evaluate during search. learning
We also (SL) policy network pσ are trained to predict
trained a faster but less accurate rollout policy pπ(a|s), using a linear We evaluated the performance of the RL policy network in game
human expert
softmax moves
of small in a data
pattern features set of
(see Extended Datapositions.
Table 4) with A play,reinforcement
sampling each move learning
a t ~ pρ (⋅|s t ) from(RL) policy
its output network
probability
weights π; this achieved an accuracy of 24.2%, using just 2 μs to select distribution over actions. When played head-to-head, the RL policy
an action, rather than 3 ms for the policy network. network won more than 80% of games against the SL policy network.
We also tested against the strongest open-source Go program, Pachi14,
Reinforcement learning of policy networks a sophisticated Monte Carlo search program, ranked at 2 amateur dan
The second stage of the training pipeline aims at improving the policy on KGS, that executes 100,000 simulations per move. Using no search
25,26
144 CHAPTER 9. ADVERSARIAL SEARCH FOR GAME PLAYING
pρ is initialized to the SL policy network, and is then improved by policy gradient learning to
maximize the outcome (that is, winning more games) against previous versions of the policy
network. A new data set is generated by playing games of self-play with the RL policy network.
Finally, a value network vθ is trained by regression to predict the expected outcome (that is,
whether the current player wins) in positions from the self-play data set.
b Schematic representation of the neural network architecture used in AlphaGo. The policy net-
work takes a representation of the board position s as its input, passes it through many convo-
lutional layers with parameters σ (SL policy network) or ρ (RL policy network), and outputs a
probability distribution pσ (a|s) or pρ (a|s) over legal moves a, represented by a probability map
over the board. The value network similarly uses many convolutional layers with parameters θ,
but outputs a scalar value vθ (s′ ) that predicts the expected outcome in position s′ .
QT
P P Q Q
Q + u(P) max Q + u(P)
QT QT
Q Q
P P
Q + u(P) max Q + u(P)
pV QT QT QT
P P
pS
r r r r
Figure 3 | Monte Carlo tree search in AlphaGo. a, Each simulation is evaluated in two ways: using the value network vθ; and by running
traverses the tree by selecting the edge with maximum action value Q, a rollout to the end of the game with the fast rollout policy pπ, then
Illustration taken from [Sil+16]
plus a bonus u(P) that depends on a stored prior probability P for that
edge. b, The leaf node may be expanded; the new node is processed once
computing the winner with function r. d, Action values Q are updated to
track the mean value of all evaluations r(·) and vθ(·) in the subtree below
by the policy network pσ and the output probabilities are stored as prior that action.
probabilities Rollout policy p : Action choice in random samples.
P for each action. c, At the end
π of a simulation, the leaf node
learning ofconvolutional
SL policynetworks,
network won 11% pσ of Action
: games against Pachi23 bias
choice (s, a)within the
of the search treeUCTS tree value
stores an action (stored
Q(s, a),as visit“P ”, N(s, a),
count
and 12% against a slightly weaker program, Fuego24. and prior probability P(s, a). The tree is traversed by simulation (that
gets smaller to “u(P )” with number of is,visits); descendingalong
the treewith quality
in complete games Q.without backup), starting
Reinforcement learning of value networks from the root state. At each time step t of each simulation, an action at
RL
The final stage policy
of the trainingnetwork pρ :onNot
pipeline focuses used
position here (used
evaluation, only
is selected fromtostatelearn
st vθ ).
estimating a value function vp(s) that predicts the outcome from posi-
Value
tion s of games played bynetwork
using policyvθp :forUsed to evaluate leaf states s, in
both players28–30
a t =linear
argmax(Q sum
(s t , a )with
+ u(s t , athe
)) value
returned by a random~sample
v p(s ) = E[z |s = s, a p]
on s. a
t t t…T
so as to maximize action value plus a bonus
Ideally, we would like to know the optimal value function under
perfect play v*(s); in practice, we instead estimate the value function P(s, a )
pρ Michael Kohlhase: Artificial Intelligence 2 229 u(s, a ) ∝ 2023-02-10
v for our strongest policy, using the RL policy network pρ. We approx- 1 + N (s, a )
imate the value function using a value network vθ(s) with weights θ,
vθ(s ) ≈ v pρ(s ) ≈ v ⁎(s ) . This neural network has a similar architecture that is proportional to the prior probability but decays with
Comments
to the policyon thebutFigure:
network, outputs a single prediction instead of a prob- repeated visits to encourage exploration. When the traversal reaches a
ability distribution. We train the weights of the value network by regres- leaf node sL at step L, the leaf node may be expanded. The leaf position
a Eachsionsimulation
on state-outcome traverses thestochastic
pairs (s, z), using tree by selecting
gradient descent to thesL isedge
processed with
just oncemaximum
by the SL policy action
network pvalue Q, plus
σ. The output prob- a
minimize the mean squared error (MSE) between the predicted value abilities are stored as prior probabilities P for each legal action a,
bonus u(P ) that depends
vθ(s), and the corresponding outcome z on a stored prior probability P for that edge.
P(s, a ) = pσ (a|s ) . The leaf node is evaluated in two very different ways:
first, by the value network vθ(sL); and second, by the outcome zL of a
∂vθ(s )
b The leaf node may∆θbe ∝ expanded;
∂θ
(z − vθ(s )) the new node israndom processed onceoutby
rollout played untilthe policy
terminal step T network
using the fast p
σ and
rollout
policy pπ; these evaluations are combined, using a mixing parameter
the output probabilities are stored as prior probabilities λ, into a leafP for each
evaluation V(sL) action.
The naive approach of predicting game outcomes from data con-
sisting of complete games leads to overfitting. The problem is that V (sL ) = (1 − λ )vθ(sL ) + λzL
c At the endpositions
successive of a simulation, thediffering
are strongly correlated, leaf nodeby just one is stone,
evaluated in two ways:
but the regression target is shared for the entire game. When trained At the end of simulation, the action values and visit counts of all
on the KGS data set in this way, the value network memorized the traversed edges are updated. Each edge accumulates the visit count and
• using the value
game outcomes network
rather than generalizingvθ ,to new positions, achieving a mean evaluation of all simulations passing through that edge
minimum MSE of 0.37 on the test set, compared to 0.19 on the training
• and bymitigate
set. To running a rollout
this problem, to the
we generated end
a new of the
self-play data game
set n
consisting of 30 million distinct positions, each sampled from a sepa- N (s, a ) = ∑ 1(s, a, i )
i=1
withrate
thegame.
fastEachrollout
game was playedpolicy between
p π,thethen
RL policy network and the winner with function
computing
itself until the game terminated. Training on this data set led to MSEs 1 n r.
Q(s, a ) = ∑ 1(s, a, i)V (siL)
N (s, a ) i =1
of 0.226 and 0.234 on the training and test set respectively, indicating
minimal overfitting. Figure 2b shows the position evaluation accuracy
i
of the value network, compared to Monte Carlo rollouts using the fast where s L is the leaf node from the ith simulation, and 1(s, a, i) indicates
rollout policy pπ; the value function was consistently more accurate. whether an edge (s, a) was traversed during the ith simulation. Once
A single evaluation of vθ(s) also approached the accuracy of Monte the search is complete, the algorithm chooses the most visited move
9.6. STATE OF THE ART 145
d Action values Q are updated to track the mean value of all evaluations r(·) and vθ (·) in the
subtree below that action.
AlphaGo, Conclusion?: This is definitely a great achievement!
• “Search + neural networks” looks like a great formula for general problem solving.
• expect to see lots of research on this in the coming decade(s).
• The AlphaGo design is quite intricate (architecture, learning workflow, training data design,
neural network architectures, . . . ).
The chess machine is an ideal one to start with, since (Claude Shannon (1949))
1. the problem is sharply defined both in allowed operations (the moves) and in the
ultimate goal (checkmate),
2. it is neither so simple as to be trivial nor too difficult for satisfactory solution,
3. chess is generally considered to require “thinking” for skilful play, [. . . ]
4. the discrete structure of chess fits well into the digital nature of modern comput-
ers.
Chess is the drosophila of Artificial Intelligence. (Alexander Kronrod (1965))
9.7 Conclusion
Summary
Games (2-player turn-taking zero-sum discrete and finite games) can be understood
as a simple extension of classical search problems.
Each player tries to reach a terminal state with the best possible utility (maximal
vs. minimal).
Minimax searches the game depth-first, max’ing and min’ing at the respective turns
of each player. It yields perfect play, but takes time O(bd ) where b is the branching
factor and d the search depth.
Except in trivial games (Tic-Tac-Toe), Minimax needs a depth limit and apply an
evaluation function to estimate the value of the cut-off states.
Alpha-beta search remembers the best values achieved for each player elsewhere in
the tree already, and prunes out sub-trees that won’t be reached in the game.
9.7. CONCLUSION 147
Monte Carlo tree search (MCTS) samples game branches, and averages the findings.
AlphaGo controls this using neural networks: evaluation function (“value network”),
and action filter (“policy network”).
Suggested Reading:
In the last chapters we have studied methods for “general problem”, i.e. such that are applicable
to all problems that are expressible in terms of states and “actions”. It is crucial to realize that
these states were atomic, which makes the algorithms employed (search algorithms) relatively
simple and generic, but does not let them exploit the any knowledge we might have about the
internal structure of states.
In this chapter, we will look into algorithms that do just that by progressing to factored states
representations. We will see that this allows for algorithms that are many orders of magnitude
more efficient than search algorithms.
To give an intuition for factored states representations we, we present some motivational ex-
amples in section 10.1 and go into detail of the Waltz algorithm, which gave rise to the main
ideas of constraint satisfaction algorithms in section 10.2. section 10.3 and section 10.4 define
constraint satisfaction problems formally and use that to develop a class of backtracking/search
based algorithms. The main contribution of the factored states representations is that we can
formulate advanced search heuristics that guide search based on the structure of the states.
149
150 CHAPTER 10. CONSTRAINT SATISFACTION PROBLEMS
Definition 10.1.2.
A constraint satisfaction problem (CSP) is a search problem, where the states are
given by a finite set V :={X 1 , . . ., X n } of variables and domains {Dv |v∈V } and the
goal state are specified by a set of constraints specifying allowable combinations of
values for subsets of variables.
Remark 10.1.5. We are using factored representation for world states now.
Simple example of a formal representation language
Allows useful general-purpose algorithms with more power than standard tree
search algorithm.
NT
Variables: WA, NT, Q,
Q NSW, V, SA, T
Northern WA
Territory
Queensland
Domains: Di = {red, green, blue}
Western SA NSW
Australia
South Constraints: adjacent regions must
have different colors e.g., WA ̸= NT
Australia New
South V
Wales
(if the language allows this), or
Victoria
⟨WA, NT⟩∈{⟨red, green⟩, ⟨red, blue⟩, ⟨green, red⟩, . . . }
Tasmania
T
(a) (b)
Figure 6.1 (a) The principal states and territories of Australia. Coloring this map can
be viewed as a constraint satisfaction problem (CSP). The goal is to assign colors to each
If A = C: vAvs.B + 1 ̸= vCvs.D
(each team alternates between home
matches and away matches).
Leading teams of last season meet
near the end of each half-season.
...
Estimated running time: End of this universe, and the next couple billion ones
after it . . .
Directly enumerate all permutations of the numbers 1, . . . , 306, test for each whether
it’s a legal Bundesliga schedule.
Estimated running time: Maybe only the time span of a few thousand uni-
verses.
View this as variables/constraints and use backtracking (this chapter)
Executed running time: About 1 minute.
How do they actually do it?: Modern computers and CSP methods: fractions
of a second. 19th (20th/21st?) century: Combinatorics and manual work.
Try it yourself: with an off-the shelf CSP solver, e.g. Minion [Min]
1. U.S. Major League Baseball, 30 teams, each 162 games. There’s one crucial additional difficulty,
in comparison to Bundesliga. Which one? Travel is a major issue here!! Hence “Traveling
Tournament Problem” in reference to the TSP.
2. This particular scheduling problem is called “car sequencing”, how to most efficiently get cars
through the available machines when making the final customer configuration (non-standard/flexible/custom
extras).
Simple methods for making backtracking aware of the structure of the problem,
and thereby reduce search.
We will now have a detailed look at the problem (and innovative solution) that started the
field of constraint satisfaction problems.
Background:
Adolfo Guzman worked on an algorithm to count the number of simple objects (like children’s
blocks) in a line drawing. David Huffman formalized the problem and limited it to objects in
general position, such that the vertices are always adjacent to three faces and each vertex is
formed from three planes at right angles (trihedral). Furthermore, the drawings could only have
three kinds of lines: object boundary, concave, and convex. Huffman enumerated all possible
configurations of lines around a vertex. This problem was too narrow for real-world situations, so
Waltz generalized it to include cracks, shadows, non-trihedral vertices and light. This resulted in
over 50 different line labels and thousands of different junctions. [ILD]
Idea: Adjacent intersections impose constraints on each other. Use CSP to find a
unique set of labelings.
Observation 10.2.1. Then each line on the images is one of the following:
a boundary line (edge of an object) (<) with right hand of arrow denoting “solid”
and left hand denoting “space”
an interior convex edge (label with “+”)
an interior concave edge (label with “-”)
Fun Fact: CSP always works perfectly! (early success story for CSP [Wal75])
Waltz’s Examples
In his dissertation 1972 [Wal75] David Waltz used the following examples
We will now work our way towards a definition of CSPs that is formal enough so that we can
define the concept of a solution. This gives use the necessary grounding to talk about algorithms
later. Video Nuggets covering this section can be found at https://fau.tv/clip/id/22277
and https://fau.tv/clip/id/22279.
Types of CSPs
Definition 10.3.1. We call a CSP discrete, iff all of the variables have countable
domains; we have two kinds:
finite domains (size d ; O(dn ) solutions)
e.g., Boolean CSPs (solvability =
b Boolean satisfiability ; NP complete)
infinite domains (e.g. integers, strings, etc.)
e.g., job scheduling, variables are start/end days for each job
need a “constraint language”, e.g., StartJob1 + 5≤StartJob3
linear constraints decidable, nonlinear ones undecidable
Types of Constraints
We classify the constraints by the number of variables they involve.
Definition 10.3.6. Unary constraints involve a single variable, e.g., SA ̸= green.
Definition 10.3.7. Binary constraints involve pairs of variables, e.g., SA ̸= WA.
Definition 10.3.8. Higher order constraints involve n = 3 or more variables, e.g.,
cryptarithmetic column constraints.
The number n of variables is called the order of the constraint.
Definition 10.3.9. Preferences (soft constraint) (e.g., red
is better than green) are often representable by a cost for each variable assignment
; constrained optimization problems.
D+E = Y + 10 · X1
S E N D
X1 + N + R = E + 10 · X2
+ M O R E
X2 + E + O = N + 10 · X3 M O N E Y
X3 + S + M = O + 10 · M
Problem: The problem structure gets hidden. (search algorithms can get
confused)
Constraint Graph
Definition 10.3.13. A binary CSP is a CSP where each constraint is binary.
Observation 10.3.14. A binary CSP forms a graph called the constraint graph
whose nodes are variables, and whose edges represent the constraints.
Example
204 204 10.3.15. Australia as a binary
ChapterCSP
6.
Chapter 6.Constraint Satisfaction
Constraint Problems
Satisfaction Problems
NT NT
Q Q
NorthernNorthern WA WA
Territory
Territory
Queensland
Queensland
WesternWestern
Australia
SA SA NSW NSW
Australia
South South
Australia
Australia New New
South South V
Wales Wales
V
VictoriaVictoria
Tasmania
Tasmania
T T
(a) (a) (b) (b)
Figure 6.1 (a) The principal states and territories of Australia. Coloring this map can
Figure 6.1 (a) The principal states and territories of Australia. Coloring this map can
Intuition: General-purpose
be viewed as a constraint
be viewed CSP
satisfaction
as a constraint algorithms
problem
satisfaction use
(CSP).(CSP).
problem The Thethe
goal isgoal graph
to assign
is colorsstructure
to assign to each to speed up
to each
colors
search. regionregion
so thatsonothat
represented
neighboring
no neighboring
as a constraint
represented
(E.g.,
regions
graph.graph.
as a constraint
have the
regions Tasmania
same
have the color. is
(b) The
same color. an independent
(b)map-coloring problem subproblem!)
problem
The map-coloring
immediately discard
immediately furtherfurther
discard refinements of theofpartial
refinements assignment.
the partial Furthermore,
assignment. we can
Furthermore, wesee
can see
why the
whyassignment is not isa solution—we
the assignment see which
not a solution—we
Michael Kohlhase: Artificial Intelligence 2
variables
see which violate
variables a constraint—so
violate
254
we
a constraint—so can
we can
2023-02-10
focus focus
attention on theonvariables
attention that matter.
the variables As a result,
that matter. manymany
As a result, problems that are
problems thatintractable
are intractable
for regular state-space
for regular searchsearch
state-space can becan
solved quickly
be solved when when
quickly formulated as a CSP.
formulated as a CSP.
6.1.26.1.2
Example problem:
Example Job-shop
problem: scheduling
Job-shop scheduling
Factories have the
Factories haveproblem of scheduling
the problem a day’s
of scheduling worthworth
a day’s of jobs,of subject to various
jobs, subject constraints.
to various constraints.
In practice, manymany
In practice, of these problems
of these are solved
problems with CSP
are solved withtechniques.
CSP techniques.Consider the problem
Consider of of
the problem
scheduling the assembly
scheduling of a car.
the assembly of aThe
car.whole job is job
The whole composed
is composedof tasks, and we
of tasks, can
and wemodel each each
can model
task as a variable,
task wherewhere
as a variable, the value of each
the value of variable is the istime
each variable the that
timethethattask
thestarts, expressed
task starts, expressed
as an asinteger number
an integer of minutes.
number of minutes.Constraints can assert
Constraints that one
can assert that task
one must occuroccur
task must beforebefore
another—for example,
another—for a wheel
example, must must
a wheel be installed beforebefore
be installed the hubcap is putison—and
the hubcap that only
put on—and that only
so many tasks tasks
so many can gocanongoat on
once. Constraints
at once. can also
Constraints can specify that athat
also specify taska takes a certain
task takes a certain
160 CHAPTER 10. CONSTRAINT SATISFACTION PROBLEMS
Real-world CSPs
Example 10.3.16 (Assignment problems). e.g., who teaches what class
Example 10.3.17 (Timetabling problems). e.g., which class is offered when and
where?
Example 10.3.18 (Hardware configuration).
Example 10.3.19 (Spreadsheets).
Example 10.3.20 (Transportation scheduling).
Note that the ideas are still the same as Example 10.1.6, but in constraint networks
we have a language to formulate things precisely.
Idea: We will explore that idea for algorithms that solve constraint networks.
fixable!)
N T = green N T = green
Q = red Q = blue
Backtracking Search
10.4. CSP AS SEARCH 163
Backtracking in Australia
Example 10.4.3. We apply backtracking search for a map coloring problem:
Step 1:
164 CHAPTER 10. CONSTRAINT SATISFACTION PROBLEMS
Step 2:
Step 3:
10.4. CSP AS SEARCH 165
Step 4:
Example 10.4.5. In step 3 of Example 10.4.3, there is only one remaining value
for SA!
Example 10.4.7.
10.5. CONCLUSION & PREVIEW 167
Where in Example 10.4.7 does the most constraining variable play a role in the choice? SA (only
possible choice), NT (all choices possible except WA, V, T). Where in the illustration does most
constrained variable play a role in the choice? NT (all choices possible except T), Q (only Q and
WA possible).
By choosing the least constraining value first, we increase the chances to not rule
out the solutions below the current node.
Example 10.4.9.
Suggested Reading:
• Chapter 6: Constraint Satisfaction Problems, Sections 6.1 and 6.3, in [RN09].
– Compared to our treatment of the topic “Constraint Satisfaction Problems” (chapter 10 and
chapter 11), RN covers much more material, but less formally and in much less detail (in par-
ticular, my slides contain many additional in-depth examples). Nice background/additional
reading, can’t replace the lecture.
– Section 6.1: Similar to my “Introduction” and “Constraint Networks”, less/different examples,
much less detail, more discussion of extensions/variations.
– Section 6.3: Similar to my “Naïve Backtracking” and “Variable- and Value Ordering”, with
less examples and details; contains part of what I cover in chapter 11 (RN does inference first,
then backtracking). Additional discussion of backjumping.
Chapter 11
Constraint Propagation
In this chapter we discuss another idea that is central to symbolic AI as a whole. The first com-
ponent is that with the factored states representations, we need to use a representation language
for (sets of) states. The second component is that instead of state-level search, we can graduate
to representation-level search (inference), which can be much more efficient that state level search
as the respective representation language actions correspond to groups of state-level actions.
11.1 Introduction
A Video Nugget covering this section can be found at https://fau.tv/clip/id/22321.
Illustration: Inference
Example 11.1.1.
A constraint network γ: 204 Chapter 6. Constraint Satisfaction Problems
NT
Q
Northern WA
Territory
Queensland
Western
Australia
SA NSW
South
Australia New
South V
Wales
Victoria
Tasmania
T
(a) (b)
Figure 6.1 (a) The principal states and territories of Australia. Coloring this map can
Example 11.1.2. immediately discard further refinements of the partial assignment. Furthermore, we can see
why the assignment is not a solution—we see which variables violate a constraint—so we can
focus attention on the variables that matter. As a result, many problems that are intractable
C WAQ := “=”. If WA and Q are assigned different colors, then NT must be assigned
for regular state-space search can be solved quickly when formulated as a CSP.
169
170 CHAPTER 11. CONSTRAINT PROPAGATION
Illustration: Decomposition
Example 11.1.3. constraint
204 network γ: Chapter 6. Constraint Satisfaction Problems
NT
Q
Northern WA
Territory
Queensland
Western
Australia
SA NSW
South
Australia New
South V
Wales
Victoria
Tasmania
T
(a) (b)
Figure 6.1 (a) The principal states and territories of Australia. Coloring this map can
11.2 Inference
A Video Nugget covering this section can be found at https://fau.tv/clip/id/22326.
Example 11.2.2. It’s what you do all the time when playing SuDoKu:
Example 11.2.4.
v1 v1
red red
γ γ′
blue blue
̸= ̸= ̸= ̸=
v1 v1
red red
γ γ′
blue blue
̸= ̸= ̸= ̸=
Tightness
Definition 11.2.5 (Tightness). Let γ:=⟨V , D, C ⟩ and γ ′ = ⟨V γ ′ , Dγ ′ , C γ ′ ⟩ be
constraint networks sharing the same set of variables, then γ ′ is tighter than γ,
(write γ ′ ⊑γ), if:
(i) For all v∈V : Dv ⊆ Dv .
(ii) For all u ̸= v ∈ V and C uv ∈C γ ′ : either C uv ̸∈C or C uv ⊆ C uv .
γ ′ is strictly tighter than γ, (written γ ′ <γ), if at least one of these inclusions is
proper.
Example 11.2.6.
v1 v1
red red
γ γ′
blue blue
̸= ̸= ̸= ̸=
v1 v1
red red
γ γ′
blue blue
̸= ̸= ̸= ̸=
v1 v1
red red
γ γ′
blue blue
̸= ̸= ̸=
v1 v1
red red
γ γ′
blue blue
̸= ̸= ̸= ̸=
Idea: Encode variable assignments as unary constraints (i.e., for a(v) = d, set the
unary constraint Dv = {d}), so that inference reasons about the network restricted
to the commitments already made.
WA NT Q NSW V SA T
WA NT Q NSW V SA T
Note: It’s a bit strange that we start with d′ here; this is to make link to arc consistency –
coming up next – as obvious as possible (same notations u, and d vs. v and d′ ).
Practical Properties:
176 CHAPTER 11. CONSTRAINT PROPAGATION
v1 v1 v1
1 1 1
v1 < v 2 v1 < v2 v1 < v 2
v2 1 2 3v 1 2 3 v3 v2 23 1 2 3 v3 v2 23 3 v3
When Forward Checking is Not Good Enough
2 < v3 v2 < v3 v2 < v3
Forward Checking
v1 v1 v1
Michael Kohlhase: Artificial Intelligence 2 280 2023-02-10
return
Example 11.4.2.
I Example 3.1.
WA NT Q NSW V SA T
WA NT Q NSW V SA T
;?
;?
Lemma 11.4.8. If d is maximal domain size in γ and the test “(d,d′ )∈C uv ?” has
running time O(1), then the running time of Revise(γ, u, v) is O(d2 ).
Example 11.4.9. Revise(γ, v 3 , v 2 )
v1
v1 < v2
v2 23 122
333 v3
v2 < v3
Proof sketch: O(md2 ) for each inner loop, fixed point reached at the latest once
all nd variable values have been removed.
Problem: There are redundant computations.
Question: Do you see what these redundant computations are?
AC-3: Example
Example 11.4.13. y div x = 0: y modulo x is 0, i.e., y is divisible by x
180 CHAPTER 11. CONSTRAINT PROPAGATION
M
(v 2 ,v 1 )
(v 1 ,v 2 )
(v 3 ,v 1 )
(v 1 ,v 3 )
M
(v 2 ,v 1 )
(v 1 ,v 2 )
(v 3 ,v 1 )
(v 1 ,v 3 )
M
(v 2 ,v 1 )
(v 1 ,v 2 )
v1
(v 3 ,v 1 )
225
M
(v 2 ,v 1 )
(v 1 ,v 2 )
v 2 div v 1 = 0 v 3 div v 1 = 0
M
v2 24 225 v3 (v 2 ,v 1 )
M
(v 2 ,v 1 )
(v 3 ,v 1 )
M
(v 2 ,v 1 )
(v 3 ,v 1 )
M
(v 2 ,v 1 )
AC-3: Runtime
11.5. DECOMPOSITION: CONSTRAINT GRAPHS, AND THREE SIMPLE CASES 181
Problem structure
182 CHAPTER 11. CONSTRAINT PROPAGATION
T
E.g., n = 80, d = 2, c = 20 Tasmania
(a) (b)
b 4 billion years at
280 = 106.1million
Figure (a) The nodes/sec
principal states and territories of Australia. Coloring this map can
be viewed as a constraint satisfaction problem (CSP). The goal is to assign colors to each
b 0.4 seconds at 10 million nodes/sec
4220 = region so that no neighboring regions have the same color. (b) The map-coloring problem
represented as a constraint graph.
Tasmania
T
(a) (b)
Figure 6.1 (a) The principal states and territories of Australia. Coloring this map can
Example 11.5.4
be viewed (Doing
as a constraint the(CSP).
satisfaction problem Numbers).
The goal is to assign colors to each
region so that no neighboring regions have the same color. (b) The map-coloring problem
represented as a constraint graph.
γ with n = 40 variables, each domain size k = 2. Four separate connected
immediately discard further refinements of the partial assignment. Furthermore, we can see
components each
why the assignment is not of size
a solution—we 10. variables violate a constraint—so we can
see which
focus attention on the variables that matter. As a result, many problems that are intractable
Reduction ofsearch
for regular state-space worst-case when
can be solved quickly whenusing decomposition:
formulated as a CSP.
Tree-structured CSPs We consider a small part of the car assembly, consisting of 15 tasks: install axles (front
and back), affix all four wheels (right and left, front and back), tighten nuts for each wheel,
affix hubcaps, and inspect the final assembly. We can represent the tasks with 15 variables:
X = {Axle F , Axle B , Wheel RF , Wheel LF , Wheel RB , Wheel LB , Nuts RF ,
Nuts LF , Nuts RB , Nuts LB , Cap RF , Cap LF , Cap RB , Cap LB , Inspect } .
The value of each variable is the time that the task starts. Next we represent precedence
PRECEDENCE
CONSTRAINTS constraints between individual tasks. Whenever a task T1 must occur before task T2 , and
task T1 takes duration d1 to complete, we add an arithmetic constraint of the form
T1 + d1 ≤ T2 .
11.5. DECOMPOSITION: CONSTRAINT GRAPHS, AND THREE SIMPLE CASES 183
Theorem 11.5.5. If the constraint graph has no cycles, the CSP can be solved in
O(nd2 ) time.
Compare to general CSPs, where worst case time is O(dn ).
This property also applies to logical and probabilistic reasoning: an important ex-
ample of the relation between syntactic restrictions and the complexity of reasoning.
Definition 11.5.8. Cutset conditioning: instantiate (in all ways) a set of variables
such that the remaining constraint graph is a tree.
Cutset size c ; running time O(dc (n − c)d2 ), very fast for small c.
Constraint networks with acyclic constraint graphs can be solved in (low order)
PTIMEpolynomial time.
204 Chapter 6. Constraint Satisfaction Problems
Example 11.5.10. Australia is not acyclic. (But see next section)
NT
Q
Northern WA
Territory
Queensland
Western
Australia
SA NSW
South
Australia New
South V
Wales
Victoria
Tasmania
T
(a) (b)
immediately discard further refinements of the partial γ with n = 40 variables, each domain size k = 2. Acyclic constraint graph.
assignment. Furthermore, we can see
why the assignment is not a solution—we see which variables violate a constraint—so we can
focus attention on the variables that matter. As a result, many problems that are intractable
for regular state-space search can be solved quickly Reduction of worst-case when using decomposition:
when formulated as a CSP.
6.1.2 Example problem: Job-shop scheduling No decomposition: 240 . With decomposition: 40 · 22 . Gain: 232 .
Factories have the problem of scheduling a day’s worth of jobs, subject to various constraints.
In practice, many of these problems are solved with CSP techniques. Consider the problem of
scheduling the assembly of a car. The whole job is composed of tasks, and we can model each
task as a variable, where the value of each variable is the time that the task starts, expressed
Michael Kohlhase: Artificial Intelligence 2 295 2023-02-10
as an integer number of minutes. Constraints can assert that one task must occur before
another—for example, a wheel must be installed before the hubcap is put on—and that only
so many tasks can go on at once. Constraints can also specify that a task takes a certain
amount of time to complete.
We consider a small part of the car assembly, consisting of 15 tasks: install axles (front
and back), affix all four wheels (right and left, front and back), tighten nuts for each wheel,
affix hubcaps, and inspect the final assembly. We can represent the tasks with 15 variables:
X = {Axle F , Axle B , Wheel RF , Wheel LF , Wheel RB , Wheel LB , Nuts RF ,
Nuts LF , Nuts RB , Nuts LB , Cap RF , Cap LF , Cap RB , Cap LB , Inspect } .
The value of each variable is the time that the task starts. Next we represent precedence
PRECEDENCE
CONSTRAINTS constraints between individual tasks. Whenever a task T1 must occur before task T2 , and
task T1 takes duration d1 to complete, we add an arithmetic constraint of the form
T +d ≤T .
11.5. DECOMPOSITION: CONSTRAINT GRAPHS, AND THREE SIMPLE CASES 185
a We assume here that γ’s constraint graph is connected. If it is not, do this and the following
AcyclicCG(γ): Example
Example 11.5.14 (AcyclicCG() execution).
v1
13
12
v1 < v2
v2 11223 33
12 v3
v2 < v 3
To apply to CSPs: allow states with unsatisfied constraints, actions reassign variable
values.
Variable selection: randomly select any conflicted variable.
Value selection: by min conflicts heuristic: choose value that violates the fewest
constraints i.e., hill climb with h(n):=total number of violated constraints
Example: 4-Queens
States: 4 queens in 4 columns (44 = 256 states)
Actions: move queen in column
Performance of min-conflicts
Given random initial state, can solve n-queens in almost constant time for arbitrary
n with high probability (e.g., n = 10,000,000)
The same appears to be true for any randomly-generated CSP except in a narrow
range of the ratio
number of constraints
R=
number of variables
188 CHAPTER 11. CONSTRAINT PROPAGATION
Arc consistency removes values that do not comply with any value still available at
the other end of a constraint. This subsumes forward checking.
The constraint graph captures the dependencies between variables. Separate con-
nected components can be solved independently. Networks with acyclic constraint
graphs can be solved in low order polynomial time.
A cutset is a subset of variables removing which renders the constraint graph acyclic.
Cutset decomposition backtracks only on such a cutset, and solves a sub-problem
with acyclic constraint graph at each search leaf.
Suggested Reading:
• Chapter 6: Constraint Satisfaction Problems in [RN09], in particular Sections 6.2, 6.3.2, and
6.5.
– Compared to our treatment of the topic “Constraint Satisfaction Problems” (chapter 10 and
chapter 11), RN covers much more material, but less formally and in much less detail (in par-
ticular, my slides contain many additional in-depth examples). Nice background/additional
reading, can’t replace the lecture.
– Section 6.3.2: Somewhat comparable to my “Inference” (except that equivalence and tightness
are not made explicit in RN) together with “Forward Checking”.
– Section 6.2: Similar to my “Arc Consistency”, less/different examples, much less detail, addi-
tional discussion of path consistency and global constraints.
– Section 6.5: Similar to my “Decomposition: Constraint Graphs, and Two Simple Cases” and
“Cutset Conditioning”, less/different examples, much less detail, additional discussion of tree
decomposition.
190 CHAPTER 11. CONSTRAINT PROPAGATION
Part III
191
193
12.1 Introduction
A Video Nugget covering this section can be found at https://fau.tv/clip/id/22455.
Definition 12.1.2 (Actions). The agent can perform the following actions: goForward,
turnRight (by 90◦ ), turnLeft (by 90◦ ), shoot arrow in direction you’re facing (you
got exactly one arrow), grab an object in current cell, leave cave if you’re in cell
[1, 1].
Definition 12.1.3 (Initial and Terminal States). Initially, the agent is in cell
[1, 1] facing east. If the agent falls down a pit or meets live Wumpus it dies.
Definition 12.1.4 (Percepts). The agent can experience the following percepts:
stench, breeze, glitter, bump, scream, none.
Cell adjacent (i.e. north, south, west, east) to Wumpus: stench (else: none).
Cell adjacent to pit: breeze (else: none).
Cell that contains gold: glitter (else: none).
You walk into a wall: bump (else: none).
Wumpus shot by arrow: scream (else: none).
195
196 CHAPTER 12. PROPOSITIONAL LOGIC & REASONING, PART I: PRINCIPLES
(1) Initial state (2) One step to right (3) Back, and up to [1,2]
Sensors
State
How the world evolves What the world
is like now
Environment
What my actions do
Agent Actuators
Figure 2.12 A model-based reflex agent. It keeps track of the current state of the world,
Idea: “Thinking”
using an= Reasoning
internal model. It thenabout
chooses anknowledge
action in the samerepresented using logic.
way as the reflex agent.
is responsible for creating the new internal state description. The details of how models and
states areMichael
represented vary
Kohlhase: widely
Artificial depending
Intelligence 2 on the type of 307
environment and the2023-02-10
particular
technology used in the agent design. Detailed examples of models and updating algorithms
appear in Chapters 4, 12, 11, 15, 17, and 25.
Regardless of the kind of representation used, it is seldom possible for the agent to
determine the current state of a partially observable environment exactly. Instead, the box
labeled “what the world is like now” (Figure 2.11) represents the agent’s “best guess” (or
sometimes best guesses). For example, an automated taxi may not be able to see around the
large truck that has stopped in front of it and can only guess about what may be causing the
12.1. INTRODUCTION 197
chapter 15: The Davis Putnam procedure and clause learning; practical problem
structure.
State-of-the-art algorithms for reasoning about propositional logic, and an im-
portant observation about how they behave.
Definition 12.2.1 (Syntax). The formulae of propositional logic (write PL0 ) are
made up from
propositional variables: V0 :={P , Q, R, P 1 , P 2 , . . .} (countably infinite)
constants/constructors called connectives: Σ0 :={T , F , ¬, ∨, ∧, ⇒, ⇔, . . .}
Definition 12.2.6. The value function I φ : wff0 (V0 )→Do assigns values to PL0
formulae. It is recursively defined,
I φ (P ) = φ(P ) (base case)
I φ (¬A) = I(¬)(I φ (A)).
I φ (A ∧ B) = I(∧)(I φ (A), I φ (B)).
12.2. PROPOSITIONAL LOGIC (SYNTAX/SEMANTICS) 201
Computing Semantics
Example 12.2.8. Let φ:=[T/P 1 ], [F/P 2 ], [T/P 3 ], [F/P 4 ], . . . then
I φ (P 1 ∨ P 2 ∨ ¬(¬P 1 ∧ P 2 ) ∨ P 3 ∧ P 4 )
= I(∨)(I φ (P 1 ∨ P 2 ), I φ (¬(¬P 1 ∧ P 2 ) ∨ P 3 ∧ P 4 ))
= I(∨)(I(∨)(I φ (P 1 ), I φ (P 2 )), I(∨)(I φ (¬(¬P 1 ∧ P 2 )), I φ (P 3 ∧ P 4 )))
= I(∨)(I(∨)(φ(P 1 ), φ(P 2 )), I(∨)(I(¬)(I φ (¬P 1 ∧ P 2 )), I(∧)(I φ (P 3 ), I φ (P 4 ))))
= I(∨)(I(∨)(T, F), I(∨)(I(¬)(I(∧)(I φ (¬P 1 ), I φ (P 2 ))), I(∧)(φ(P 3 ), φ(P 4 ))))
= I(∨)(T, I(∨)(I(¬)(I(∧)(I(¬)(I φ (P 1 )), φ(P 2 ))), I(∧)(T, F)))
= I(∨)(T, I(∨)(I(¬)(I(∧)(I(¬)(φ(P 1 )), F)), F))
= I(∨)(T, I(∨)(I(¬)(I(∧)(I(¬)(T), F)), F))
= I(∨)(T, I(∨)(I(¬)(I(∧)(F, F)), F))
= I(∨)(T, I(∨)(I(¬)(F), F))
= I(∨)(T, I(∨)(T, F))
= I(∨)(T, T)
= T
What a mess!
Now we will also review some propositional identities that will be useful later on. Some of them we
have already seen, and some are new. All of them can be proven by simple truth table arguments.
Propositional Identities
We have the following identities in propositional logic:
We will now use the distribution of values of a Boolean expression under all (variable) assignments
202 CHAPTER 12. PROPOSITIONAL LOGIC & REASONING, PART I: PRINCIPLES
to characterize them semantically. The intuition here is that we want to understand theorems,
examples, counterexamples, and inconsistencies in mathematics and everyday reasoning1 .
The idea is to use the formal language of Boolean expressions as a model for mathematical
language. Of course, we cannot express all of mathematics as Boolean expressions, but we can at
least study the interplay of mathematical statements (which can be true or false) with the copula
“and”, “or” and “not”.
Let us now see how these semantic properties model mathematical practice.
In mathematics we are interested in assertions that are true in all circumstances. In our model
of mathematics, we use variable assignments to stand for circumstances. So we are interested
in Boolean expressions which are true under all variable assignments; we call them valid. We
often give examples (or show situations) which make a conjectured assertion false; we call such
examples counterexamples, and such assertions “falsifiable”. We also often give examples for certain
assertions to show that they can indeed be made true (which is not the same as being valid
yet); such assertions we call “satisfiable”. Finally, if an assertion cannot be made true in any
circumstances we call it “unsatisfiable”; such assertions naturally arise in mathematical practice in
the form of refutation proofs, where we show that an assertion (usually the negation of the theorem
we want to prove) leads to an obviously unsatisfiable conclusion, showing that the negation of the
theorem is unsatisfiable, and thus the theorem valid.
1 Here (and elsewhere) we will use mathematics (and the language of mathematics) as a test tube for under-
standing reasoning, since mathematics has a long history of studying its own reasoning processes and assumptions.
12.3. PREDICATE LOGIC WITHOUT QUANTIFIERS 203
3. 1. together with 2.1 entails that ai(x) ⇒ bla(x) for every x∈{S, N , J},
4. thus ¬bla(S) ∧ ¬bla(J) by 3. and 2.2 and
5. so ¬ai(S) ∧ ¬ai(J) by 3. and 4.
6. With 2.3 the latter entails ai(N ).
In the hair-color example we have seen that we are able to model complex situations in PL0 .
The trick of using variables with fancy names like bla(N ) is a bit dubious, and we can already
imagine that it will be difficult to support programmatically unless we make names like bla(N )
into first class citizens i.e. expressions of the logic language themselves.
Idea: Re-use PL0 , but replace propositional variables with something more expres-
sive! (instead of fancy variable name
trick)
Definition 12.3.1. A first-order signature consists of pairwise disjoint, countable
sets for each k∈N
Definition 12.3.2.
The formulae of PLnq are given by the following grammar
functions fk ∈ Σfk
predicates pk ∈ Σpk
terms t ::= X variable
| f0 constant
| f k (t1 , . . ., tk ) application
formulae A ::= pk (t1 , . . ., tk ) atomic
| ¬A negation
| A1 ∧ A2 conjunction
PLNQ Semantics
Definition 12.3.3. Universes Do = {T, F} of truth values and Dι ̸= ∅ of individu-
als.
Definition 12.3.5.
12.3. PREDICATE LOGIC WITHOUT QUANTIFIERS 205
Corollary 12.3.12. PLnq is isomorphic to PL0 , i.e. the following diagram commutes:
206 CHAPTER 12. PROPOSITIONAL LOGIC & REASONING, PART I: PRINCIPLES
ψ 7→ Mψ
⟨Dψ , I ψ ⟩ V Σ → {T, F}
I ψ () I φM ()
θΣ
PLnq (Σ1 ) PL0 (AΣ )
3.2. If A = ¬B, then I (A) = T, iff I (B) = F, iff I ψ (B) = I ψ (B), iff
ψ ψ
I ψ (A) = I ψ (A).
3.3. If A = B ∧ C then we argue similarly
4. Hence I ψ (A) = I ψ (A) for all PLnq formulae and we have concluded the proof.
We have now defined syntax (the language agents can use to represent knowledge) and its se-
mantics (how expressions of this language relate to the world the agent’s environment). Theoreti-
cally, an agent could use the entailment relation to derive new knowledge percepts and the existing
state representation – in the MAKE−PERCEPT−SENTENCE and MAKE−ACTION−SENTENCE
subroutines below. But as we have seen in above, this is very tedious. A much better way would
be to have a set of rules that directly act on the state representations.
Sensors
State
How the world evolves What the world
is like now
Environment
What my actions do
Agent Actuators
Figure 2.12 A model-based reflex agent. It keeps track of the current state of the world,
Idea: “Thinking”
using an= Reasoning
internal model. It thenabout
chooses anknowledge
action in the samerepresented using logic.
way as the reflex agent.
is responsible for creating the new internal state description. The details of how models and
states areMichael
represented vary
Kohlhase: widely
Artificial depending
Intelligence 2 on the type of 329
environment and the2023-02-10
particular
technology used in the agent design. Detailed examples of models and updating algorithms
appear in Chapters 4, 12, 11, 15, 17, and 25.
Regardless of the kind of representation used, it is seldom possible for the agent to
A Simple Formal System: Prop. Logic with Hilbert-Calculus
determine the current state of a partially observable environment exactly. Instead, the box
labeled “what the world is like now” (Figure 2.11) represents the agent’s “best guess” (or
sometimes best guesses). For example, an automated taxi may not be able to see around the
large truck that has stopped in front of it and can only guess about what may be causing the
Formulae: hold-up.
builtThus,
from propositional
uncertainty about the current variables: , Q, R. . .butand
state may bePunavoidable, implication:
the agent still has ⇒
to make a decision.
Semantics: IAφperhaps
(P ) =lessφ(P ) and
obvious point I φ (A
about the⇒ B) “state”
internal iff I φ (A)
= T,maintained = F or I φ (B) =
by a model-based T.
agent is that it does not have to describe “what the world is like now” in a literal sense. For
Definition 12.4.1. The Hilbert calculus H consists of the inference rules:
0
K S
P ⇒Q⇒P (P ⇒ Q ⇒ R) ⇒ (P ⇒ Q) ⇒ P ⇒ R
A⇒B A A
MP Subst
B [B/X](A)
This is indeed a very simple formal system, but it has all the required parts:
• A formal language: expressions built up from variables and implications.
• A semantics: given by the obvious interpretation function
• A calculus: given by the two axioms and the two inference rules.
The calculus gives us a set of rules with which we can derive new formulae from old ones. The
axioms are very simple rules, they allow us to derive these two formulae in any situation. The
proper inference rules are slightly more complicated: we read the formulae above the horizontal
line as assumptions and the (single) formula below as the conclusion. An inference rule allows us
to derive the conclusion, if we have already derived the assumptions.
Now, we can use these inference rules to perform a proof – a sequence of formulae that can be
derived from each other. The representation of the proof in the slide is slightly compactified to fit
onto the slide: We will make it more explicit here. We first start out by deriving the formula
(P ⇒ Q ⇒ R) ⇒ (P ⇒ Q) ⇒ P ⇒ R (12.1)
which we can always do, since we have an axiom for this formula, then we apply the rule Subst,
where A is this result, B is C, and X is the variable P to obtain
(C ⇒ Q ⇒ R) ⇒ (C ⇒ Q) ⇒ C ⇒ R (12.2)
Next we apply the rule Subst to this where B is C ⇒ C and X is the variable Q this time to obtain
(C ⇒ (C ⇒ C) ⇒ R) ⇒ (C ⇒ C ⇒ C) ⇒ C ⇒ R (12.3)
And again, we apply the inference rulerule Subst this time, B is C and X is the variable R yielding
the first formula in our proof on the slide. To conserve space, we have combined these three steps
into one in the slide. The next steps are done in exactly the same way.
In general formulae can be used to represent facts about the world as propositions; they have a
semantics that is a mapping of formulae into the real world (propositions are mapped to truth
values.) We have seen two relations on formulae: the entailment relation and the deduction
relation. The first one is defined purely in terms of the semantics, the second one is given by a
calculus, i.e. purely syntactically. Is there any relation between these relations?
Goal: Find calculi C, such that ⊢C A iff |=A (provability and validity coincide)
To TRUTH through PROOF (CALCULEMUS [Leibniz ∼1680])
12.5. PROPOSITIONAL NATURAL DEDUCTION CALCULUS 209
Ideally, both relations would be the same, then the calculus would allow us to infer all facts
that can be represented in the given formal language and that are true in the real world, and only
those. In other words, our representation and inference is faithful to the world.
A consequence of this is that we can rely on purely syntactical means to make predictions
about the world. Computers rely on formal representations of the world; if we want to solve a
problem on our computer, we first represent it in the computer (as data structures, which can be
seen as a formal language) and do syntactic manipulations on these structures (a form of calculus).
Now, if the provability relation induced by the calculus and the validity relation coincide (this will
be quite difficult to establish in general), then the solutions of the program will be correct, and
we will find all possible ones.
Of course, the logics we have studied so far are very simple, and not able to express interesting
facts about the world, but we will study them as a simple example of the fundamental problem of
computer science: How do the formal representations correlate with the real world. Within the
world of logics, one can derive new propositions (the conclusions, here: Socrates is mortal) from
given ones (the premises, here: Every human is mortal and Sokrates is human). Such derivations
are proofs.
In particular, logics can describe the internal structure of real-life facts; e.g. individual things,
actions, properties. A famous example, which is in fact as old as it appears, is illustrated in the
slide below.
If a logic is correct, the conclusions one can prove are true (= hold in the real world) whenever
the premises are true. This is a miraculous fact (think about it!)
for propositional logic. The calculus was created in order to model the natural mode of reasoning
e.g. in everyday mathematical practice. In particular, it was intended as a counter-approach to
the well-known Hilbert style calculi, which were mainly used as theoretical devices for studying
reasoning in principle, not for modeling particular reasoning styles. We will introduce natural
deduction in two styles/notation, both were invented by Gerhard Gentzen in the 1930’s and
are very much related. The Natural Deduction style (ND) uses “local hypotheses” in proofs for
hypothetical reasoning, while the “sequent style” is a rationalized version and extension of the ND
calculus that makes certain meta-proofs simpler to push through by making the context of local
hypotheses explicit in the notation. The sequent notation also constitutes a more adequate data
struture for implementations, and user interfaces.
Rather than using a minimal set of inference rules, we introduce a natural deduction calculus that
provides two/three inference rules for every logical constant, one “introduction rule” (an inference
rule that derives a formula with that symbol at the head) and one “elimination rule” (an inference
rule that acts on a formula with this head and derives a set of subformulae).
B A⇒B A
⇒I 1 ⇒E
A⇒B B
The most characteristic rule in the natural deduction calculus is the ⇒I rule and the hypothetical
reasoning it introduce. ⇒I corresponds to the mathematical way of proving an implication A⇒B:
We assume that A is true and show B from this local hypothesis. When we can do this we discharge
the assumption and conclude A ⇒ B.
Note that the local hypothesis is discharged by the rule ⇒I, i.e. it cannot be used in any other
part of the proof. As the ⇒I rules may be nested, we decorate both the rule and the corresponding
assumption with a marker (here the number 1).
Let us now consider an example of hypothetical reasoning in action.
12.5. PROPOSITIONAL NATURAL DEDUCTION CALCULUS 211
1
[A ∧ B]1 [A ∧ B]1 [A]
∧Er ∧El 2
B A [B]
∧I A
B∧A ⇒I 2
1
⇒I B⇒A
A∧B⇒B∧A ⇒I 1
A⇒B⇒A
Here we see hypothetical reasoning with local hypotheses at work. In the left example, we assume
the formula A ∧ B and can use it in the proof until it is discharged by the rule ∧El on the bottom
– therefore we decorate the hypothesis and the rule by corresponding numbers (here the label “1”).
Note the assumption A ∧ B is local to the proof fragment delineated by the corresponding local
hypothesishypothesis and the discharging rule, i.e. even if this proof is only a fragment of a larger
proof, then we cannot use its local hypothesishypothesis anywhere else.
Note also that we can use as many copies of the local hypothesis as we need; they are all
discharged at the same time.
In the right example we see that local hypotheses can be nested as long as they are kept local. In
particular, we may not use the hypothesis B after the ⇒I 2 , e.g. to continue with a ⇒E.
One of the nice things about the natural deduction calculus is that the deduction theorem is
almost trivial to prove. In a sense, the triviality of the deduction theorem is the central idea of
the calculus and the feature that makes it so natural.
Another characteristic of the natural deduction calculus is that it has inference rules (introduction
and elimination rules) for all connectives. So we extend the set of rules from Definition 12.5.1 for
disjunction, negation and falsity.
ing connectives.
1 1
[A] [B]
.. ..
A∨B . .
A B C C ∨E 1
∨Il ∨Ir
A∨B A∨B C
1 1
[A] [A]
.. ..
. .
C ¬C ¬I 1 ¬¬A
¬E
¬A A
¬A A F
FI FE
F A
Ax Ax
A ∧ B⊢A ∧ B A ∧ B⊢A ∧ B Ax
∧Er ∧El A, B⊢A
A ∧ B⊢B A ∧ B⊢A ⇒I
∧I A⊢B ⇒ A
A ∧ B⊢B ∧ A ⇒I
⇒I ⊢A ⇒ B ⇒ A
⊢A ∧ B ⇒ B ∧ A
Definition 12.5.9. The following inference rules make up the propositional sequent
style natural deduction calculus ND⊢0 :
Γ⊢B
Ax weaken TND
Γ, A⊢A Γ, A⊢B Γ⊢A ∨ ¬A
Γ, A⊢F Γ⊢¬¬A
¬I ¬E
Γ⊢¬A Γ⊢A
Each row in the table represents one inference step in the proof. It consists of line number (for
referencing), a formula for the asserted property, a justification via a ND rules (and the rows this
one is derived from), and finally a list of row numbers of proof steps that are local hypotheses in
effect for the current row.
12.6 Conclusion
A Video Nugget covering this section can be found at https://fau.tv/clip/id/25027.
214 CHAPTER 12. PROPOSITIONAL LOGIC & REASONING, PART I: PRINCIPLES
Summary
Sometimes, it pays off to think before acting.
Propositional logic formulas are built from atomic propositions, with the connectives
and, or, not.
Suggested Reading:
• Chapter 7: Logical Agents, Sections 7.1 – 7.5 [RN09].
– Sections 7.1 and 7.2 roughly correspond to my “Introduction”, Section 7.3 roughly corresponds
to my “Logic (in AI)”, Section 7.4 roughly corresponds to my “Propositional Logic”, Section
7.5 roughly corresponds to my “Resolution” and “Killing a Wumpus”.
– Overall, the content is quite similar. I have tried to add some additional clarifying illustrations.
RN gives many complementary explanations, nice as additional background reading.
12.6. CONCLUSION 215
– I would note that RN’s presentation of resolution seems a bit awkward, and Section 7.5 con-
tains some additional material that is imho not interesting (alternate inference rules, forward
and backward chaining). Horn clauses and unit resolution (also in Section 7.5), on the other
hand, are quite relevant.
216 CHAPTER 12. PROPOSITIONAL LOGIC & REASONING, PART I: PRINCIPLES
Chapter 13
Recall: Our knowledge of the cave entails a definite Wumpus position!(slide 306)
Problem: That was human reasoning, can we build an agent function that does
this?
Answer:
As for constraint networks, we use inference, here resolution/tableaux.
Unsatisfiability Theorem
Theorem 13.0.1 (Unsatisfiability Theorem). H |= A iff H ∪ {¬A} is unsatisfi-
able.
Proof: We prove both directions separately
1. “⇒”: Say H |= A
1.1. For any φ with φ|=H we have φ|=A and thus φ̸|=¬A.
2. “⇐”: Say H ∪ {¬A} is unsatisfiable.
2.1. For any φ with φ|=H we have φ̸|=¬A and thus φ|=A.
Observation 13.0.2. Entailment can be tested via satisfiability.
217
218 CHAPTER 13. MACHINE-ORIENTED CALCULI FOR PROPOSITIONAL LOGIC
Definition 13.0.3. Given a formal system ⟨L, K, |=, C ⟩, the task of theorem proving
consists in determining whether H⊢C C for a conjecture C∈L and hypotheses H ⊆
L.
Definition 13.0.4. Given a logical system L:=⟨L, K, |=⟩, the task of automated
theorem proving (ATP) consists of developing calculi for L and programs – called
(automated) theorem provers – that given a set H ⊆ L of hypotheses and a conjec-
ture A∈L determine whether H |= A (usually by searching for C-derivations H⊢C A
in a calculus C).
Idea: ATP with a calculus C for ⟨L, K, |=⟩ induces a search problem Π:=⟨S , A, T , I , G ⟩,
where the states S are sets of formulae in L, the actions A are the inference rules
from C, the initial state I = {H}, and the goal states are those with A∈S.
Problem: ATP as a search problem does not admit good heuristics, since these
need to take the conjecture A into account.
Idea: Turn the search around – using the unsatisfiability theorem (Theorem 13.0.1).
Observation: A test calculus C induces a search problem where the initial state
is H ∪ {¬A} and S∈S is a goal state iff ⊥∈S.(proximity of ⊥ easier for heuristics)
The idea about literals is that they are atoms (the simplest formulae) that carry around their
intended truth value.
13.2. ANALYTICAL TABLEAUX 219
Normal Forms
There are two quintessential normal forms for propositional formulae: (there are
others as well)
Definition 13.1.6.
A formula is in conjunctive normal^ form_
(CNF) if it is a conjunction of disjunctions
of literals: i.e. if it is of the form i=1 m
n
j=1 lij
i
Definition 13.1.7.
A formula is in disjunctive normal_ form ^
(DNF) if it is a disjunction of conjunctions
of literals: i.e. if it is of the form ni=1 mj=1 lij
i
Observation 13.1.8. Every formula has equivalent formulae in CNF and DNF.
Tableau calculi develop a formula in a tree-shaped arrangement that represents a case analysis on
when a formula can be made true (or false). Therefore the formulae are decorated with exponents
that hold the intended truth value.
On the left we have a refutation tableau that analyzes a negated formula (it is decorated with the
intended truth value F). Both branches contain an elementary contradiction ⊥.
On the right we have a model generation tableau, which analyzes a positive formula (it is
decorated with the intended truth value T. This tableau uses the same rules as the refutation
tableau, but makes a case analysis of when this formula can be satisfied. In this case we have a
closed branch and an open one, which corresponds a model).
Now that we have seen the examples, we can write down the tableau rules formally.
Aα
T F α ̸= β
(A ∧ B) (A ∧ B) ¬A T
¬A F
Aβ
T0 ∧
T0 ∨ T0 ¬T T0 ¬F T0 ⊥
AT AF BF AF AT ⊥
BT
Definition 13.2.4. Call a tableau saturated, iff no rule adds new material and a
branch closed, iff it ends in ⊥, else open. A tableau is closed, iff all of its branches
are.
These inference rules act on tableaux have to be read as follows: if the formulae over the line
appear in a tableau branch, then the branch can be extended by the formulae or branches below
the line. There are two rules for each primary connective, and a branch closing rule that adds the
special symbol ⊥ (for unsatisfiability) to a branch.
We use the tableau rules with the convention that they are only applied, if they contribute new
material to the branch. This ensures termination of the tableau procedure for propositional logic
(every rule eliminates one primary connective).
Definition 13.2.5. We will call a closed tableau with the labeled formula Aα at the root a
tableau refutation for Aα .
The saturated tableau represents a full case analysis of what is necessary to give A the truth value
α; since all branches are closed (contain contradictions) this is impossible.
Definition 13.2.7. We will call a tableau refutation for AF a tableau proof for A, since it refutes
the possibility of finding a model where A evaluates to F. Thus A must evaluate to T in all
models, which is just our definition of validity.
Thus the tableau procedure can be used as a calculus for propositional logic. In contrast to the
propositional Hilbert calculus it does not prove a theorem A by deriving it from a set of axioms,
but it proves it by refuting its negation. Such calculi are called negative or test calculi. Generally
negative calculi have computational advantages over positive ones, since they have a built-in sense
of direction.
We have rules for all the necessary connectives (we restrict ourselves to ∧ and ¬, since the others
can be expressed in terms of these two via the propositional identities above. For instance, we can
write A ∨ B as ¬(¬A ∧ ¬B), and A ⇒ B as ¬A ∨ B,. . . .)
We now look at a formulation of propositional logic with fancy variable names. Note that
loves(mary, bill) is just a variable name like P or X, which we have used earlier.
Example 13.2.8. If Mary loves Bill and John loves Mary, then John loves Mary
F
(loves(mary, bill) ∧ loves(john, mary) ⇒ loves(john, mary))
F
¬(¬¬(loves(mary, bill) ∧ loves(john, mary)) ∧ ¬loves(john, mary))
T
(¬¬(loves(mary, bill) ∧ loves(john, mary)) ∧ ¬loves(john, mary))
T
¬¬(loves(mary, bill) ∧ loves(john, mary))
F
¬(loves(mary, bill) ∧ loves(john, mary))
T
(loves(mary, bill) ∧ loves(john, mary))
T
¬loves(john, mary)
T
loves(mary, bill)
T
loves(john, mary)
F
loves(john, mary)
⊥
is valid.
We could have used the unsatisfiability theorem (Theorem 13.0.1) here to show that If Mary loves
Bill and John loves Mary entails John loves Mary. But there is a better way to show entailment:
we directly use derivability in T0 .
Deriving Entailment in T0
Example 13.2.9. Mary loves Bill and John loves Mary together entail that John
loves Mary
T
loves(mary, bill)
T
loves(john, mary)
F
loves(john, mary)
⊥
This is a closed tableau, so {loves(mary, bill), loves(john, mary)}⊢T0 loves(john, mary).
Again, as T0 is sound and complete we have
Note: We can also use the tableau calculus to try and show entailment (and fail). The nice thing
is that the failed proof, we can see what went wrong.
Obviously, the tableau above is saturated, but not closed, so it is not a tableau proof for our initial
entailment conjecture. We have marked the literal on the open branch green, since they allow us
to read of the conditions of the situation, in which the entailment fails to hold. As we intuitively
argued above, this is the situation, where Mary loves Bill. In particular, the open branch gives us
a variable assignment (marked in green) that satisfies the initial formula. In this case, Mary loves
Bill, which is a situation, where the entailment fails.
Again, the derivability version is much simpler:
We have seen in the examples above that while it is possible to get by with only the connectives
∨ and ¬, it is a bit unnatural and tedious, since we need to eliminate the other connectives first.
In this chapter, we will make the calculus less frugal by adding rules for the other connectives,
without losing the advantage of dealing with a small calculus, which is good making statements
about the calculus itself.
We will convince ourselves that the first rule is derivable, and leave the other ones as an exercise.
AT AT
T T
(A ⇒ B)
T
(A ⇒ B)
F
(A ⇒ B) (A ⇒ B)
T
AT BT (¬A ∨ B)
AF BT T
BF ¬(¬¬A ∧ ¬B)
F
T F
(¬¬A ∧ ¬B)
(A ∨ B) (A ∨ B) A ⇔ BT A ⇔ BF ¬¬AF ¬BF
AF AT AF AT AF ¬AT BT
AT BT
BF BT BF BF BT AF
⊥
With these derived rules, theorem proving becomes quite efficient. With these rules, the tableau
(Example 13.2.8) would have the following simpler form:
As always we need to convince ourselves that the calculus is sound, otherwise, tableau proofs
do not guarantee validity, which we are after. Since we are now in a refutation setting we cannot
just show that the inference rules preserve validity: we care about unsatisfiability (which is the
dual notion to validity), as we want to show the initial labeled formula to be unsatisfiable. Before
we can do this, we have to ask ourselves, what it means to be (un)-satisfiable for a labeled formula
or a tableau.
Soundness (Tableau)
Idea: A test calculus is refutation sound, iff its inference rules preserve satisfiability
13.4. SOUNDNESS AND TERMINATION OF TABLEAUX 225
Thus we only have to prove Lemma 13.4.3, this is relatively easy to do. For instance for the first
T
rule: if we have a tableau that contains (A ∧ B) and is satisfiable, then it must have a satisfiable
T
branch. If (A ∧ B) is not on this branch, the tableau extension will not change satisfiability,
so we can assume that it is on the satisfiable branch and thus I φ (A ∧ B) = T for some variable
assignment φ. Thus I φ (A) = T and I φ (B) = T, so after the extension (which adds the formulae
AT and BT to the branch), the branch is still satisfiable. The cases for the other rules are
similar.
The next result is a very important one, it shows that there is a procedure (the tableau
procedure) that will always terminate and answer the question whether a given propositional
formula is valid or not. This is very important, since other logics (like the often-studied first-order
logic) does not enjoy this property.
Note: The proof above only works for the “base T0 ” because (only) there the rules do not “copy”.
A rule like
A ⇔ BT
AT AF
BT BF
does, and in particular the number of non-worked-off variables below the line is larger than above
the line. For such rules, we would have a more intricate version of µ which – instead of returning
a natural number – returns a more complex object; a multiset of numbers. would work here. In
our proof we are just assuming that the defined connectives have already eliminated.
The tableau calculus basically computes the disjunctive normal form: every branch is a disjunct
that is a conjunction of literals. The method relies on the fact that a DNF is unsatisfiable, iff each
literal is, i.e. iff each branch contains a contradiction in form of a pair of opposite literals.
2 for the “empty” disjunction (no disjuncts) and call it the empty clause. A clause
with exactly one literal is called a unit clause.
Definition 13.5.2 (Resolution Calculus). The resolution calculus R0 operates a
clause sets via a single inference rule:
PT ∨ A PF ∨ B
R
A∨B
This rule allows to add the resolvent (the clause below the line) to a clause set which
contains the two clauses above. The literals P T and P F are called cut literals.
Definition 13.5.3 (Resolution Refutation). Let S be a clause set, then we call
an R0 -derivation of 2 from S R0 -refutation and write D : S⊢R0 2.
Definition 13.5.4.
We will often write a clause set {C 1 , . . ., C n } as C 1 ; . . . ; C n , use S ; T for the
union of the clause sets S and T , and S ; C for the extension by a clause C.
Definition 13.5.5 (Transformation into Clause Normal Form). The CNF trans-
formation calculus CNF0 consists of the following four inference rules on sets of
labeled formulae.
T F
C ∨ (A ∨ B) C ∨ (A ∨ B) C ∨ ¬AT C ∨ ¬AF
C ∨ AT ∨ BT C ∨ AF ; C ∨ BF C ∨ AF C ∨ AT
Definition 13.5.6. We write CNF0 (Aα ) for the set of all clauses derivable from
Aα via the rules above.
that the C-terms in the definition of the inference rules are necessary, since we assumed that
the assumptions of the inference rule must match full clauses. The C terms are used with the
T
convention that they are optional. So that we can also simplify (A ∨ B) to AT ∨ BT .
Background: The background behind this notation is that A and T ∨ A are equivalent for any
A. That allows us to interpret the C-terms in the assumptions as T and thus leave them out.
The clause normal form translation as we have formulated it here is quite frugal; we have left
out rules for the connectives ∨, ⇒, and ⇔, relying on the fact that formulae containing these
connectives can be translated into ones without before CNF transformation. The advantage of
having a calculus with few inference rules is that we can prove meta properties like soundness and
completeness with less effort (these proofs usually require one case per inference rule). On the
other hand, adding specialized inference rules makes proofs shorter and more readable.
Fortunately, there is a way to have your cake and eat it. Derivable inference rules have the property
that they are formally redundant, since they do not change the expressive power of the calculus.
Therefore we can leave them out when proving meta-properties, but include them when actually
using the calculus.
T
C ∨ (A ⇒ B)
T
C ∨ (¬AB) C ∨ (A ⇒ B)
T
Example 13.5.8. ;
C ∨ ¬AT ∨ BT C ∨ AF ∨ BT
C ∨ AF ∨ BT
With these derivable rules, theorem proving becomes quite efficient. To get a better under-
standing of the calculus, we look at an example: we prove an axiom of the Hilbert Calculus we
have studied above.
Result {P F ∨ QF ∨ RT , P F ∨ QT , P T , RF }
Example 13.5.10. Resolution Proof
1 P F ∨ QF ∨ RT initial
2 P F ∨ QT initial
3 PT initial
4 RF initial
5 P F ∨ QF resolve 1.3 with 4.1
6 QF resolve 5.1 with 3.1
7 PF resolve 2.2 with 6.1
8 2 resolve 7.1 with 3.1
Then ∆ is satisfiable, iff ∆′ is. We call ∆′ the clause set simplification of ∆ wrt. l.
Corollary 13.5.11. Adding clause set simplification wrt. unit clauses to R0 does
not affect soundness and completeness.
This is almost always a good idea! (clause set simplification is cheap)
Before we come to the general mechanism, we will go into how we would “convince ourselves that
the Wumpus is in [1, 3].
Idea: We formalize the knowledge about the Wumpus world in PL0 and use a test
calculus to check for entailment.
Simplification: We worry only about the Wumpus and stench:
b stench in [i, j], W i,j =
S i,j = b Wumpus in [i, j].
Propositions whose value we know: ¬S 1,1 , ¬W 1,1 , ¬S 2,1 , ¬W 2,1 , S 1,2 , ¬W 1,2 .
The first in is to compute the clause normal form of the relevant knowledge.
Given this clause normal form, we only need to find generate empty clause via repeated applications
of the resolution rule.
Now that we have seen how we can use propositional inference to derive consequences of the
percepts and world knowledge, let us come back to the question of a general mechanism for agent
functions with propositional inference.
Admittedly, the search framework from chapter 8 does not quite cover the agent function we
have here, since that assumes that the world is fully observable, which the Wumpus world is
emphatically not. But it already gives us a good impression of what would be needed for the
“general mechanism”.
Summary
Every propositional formula can be brought into conjunctive normal form (CNF),
which can be identified with a set of clauses.
The tableau and resolution calculi are deduction procedures based on trying to derive
a contradiction from the negated theorem (a closed tableau or the empty clause).
They are refutation complete, and can be used to prove KB |= A by showing that
KB ∪ {¬A} is unsatisfiable.
232 CHAPTER 13. MACHINE-ORIENTED CALCULI FOR PROPOSITIONAL LOGIC
Excursion: A full analysis of any calculus needs a completeness proof. We will not cover this in
AI-2, but provide one for the calculi introduced so far in??.
Chapter 14
We will now take a more abstract view and introduce the necessary prerequisites of abstract rule
systems. We will also take the opportunity to discuss the quality criteria for calculi.
The notion of a logical system is at the basis of the field of logic. In its most abstract form, a logical
233
234 CHAPTER 14. FORMAL SYSTEMS
system consists of a formal language, a class of models, and a satisfaction relation between models
and expressions of the formal language. The satisfaction relation tells us when an expression is
deemed true in this model.
Logical Systems
Definition 14.0.1. A logical system (or simply a logic) is a triple L:=⟨L, K, |=⟩,
where L is a formal language, K is a set and |= ⊆ K × L. Members of L are called
formulae of L, members of K models for L, and |= the satisfaction relation.
Example 14.0.2 (Propositional Logic).
⟨wff(ΣP L0 , V P L0 ), K, |=⟩ is a logical system, if we define K:=V0 ⇀ D0 (the set of
variable assignments) and φ |= A iff I φ (A) = T.
Definition 14.0.3.
Let ⟨L, K, |=⟩ be a logical system, M∈K be a model and A∈L a formula, then we
say that A is
Let us now turn to the syntactical counterpart of the entailment relation: derivability in a
calculus. Again, we take care to define the concepts at the general level of logical systems.
The intuition of a calculus is that it provides a set of syntactic rules that allow to reason by
considering the form of propositions alone. Such rules are called inference rules, and they can be
strung together to derivations — which can alternatively be viewed either as sequences of formulae
where all formulae are justified by prior formulae or as trees of inference rule applications. But we
can also define a calculus in the more general setting of logical systems as an arbitrary relation on
formulae with some general properties. That allows us to abstract away from the homomorphic
setup of logics and calculi and concentrate on the basics.
Definition 14.0.6.
235
Let L be the formal language of a logical system, then an inference rule over L is
a decidable n + 1 ary relation on L. Inference rules as traditionally written as
A1 . . . An
N
C
where A1 , . . ., An and C are formula schemata for L and N is a name.
The Ai are called assumptions of N , and C is called its conclusion.
With formula schemata we mean representations of sets of formulae, we use boldface uppercase
letters as (meta)-variables for formulae, for instance the formula schema A ⇒ B represents the set
of formulae whose head is ⇒.
Derivations
Definition 14.0.9.Let L:=⟨L, K, |=⟩ be a logical system and C a calculus for L,
then a C-derivation of a formula C∈L from a set H ⊆ L of hypotheses (write
H⊢C C) is a sequence A1 , . . ., Am of L-formulae, such that
Am = C, (derivation culminates in C)
for all 1≤i≤m, either Ai ∈H, or (hypothesis)
Al 1 . . . Al k
there is an inference rule in C with lj <i for all j≤k. (rule
Ai
application)
We can also see a derivation as a derivation tree, where the Alj are the children of
the node Ak .
Example 14.0.10.
Inference rules are relations on formulae represented by formula schemata (where boldface, up-
percase letters are used as meta-variables for formulae). For instance, in Example 14.0.10 the
A⇒B A
inference rule was applied in a situation, where the meta-variables A and B were
B
instantiated by the formulae P and Q ⇒ P .
As axioms do not have assumptions, they can be added to a derivation at any time. This is just
what we did with the axioms in Example 14.0.10.
Formal Systems
236 CHAPTER 14. FORMAL SYSTEMS
Let ⟨L, K, |=⟩ be a logical system and C a calculus, then ⊢C is a derivation relation
and thus ⟨L, K, |=, ⊢C ⟩ a derivation system.
Therefore we will sometimes also call ⟨L, K, |=, C ⟩ a formal system, iff L:=⟨L, K, |=⟩
is a logical system, and C a calculus for L.
Definition 14.0.11.
Let C be a calculus, then a C-derivation ∅⊢C A is called a proof of A and if one
exists (write ⊢C A) then A is called a C-theorem.
Definition 14.0.12.
An inference rule I is called admissible in a calculus C, if the extension of C by I
does not yield new theorems.
A1 . . . An
Definition 14.0.13. An inference rule is called derivable (or a derived
C
rule) in a calculus C, if there is a C derivation A1 , . . ., An ⊢C C.
Observation 14.0.14. Derivable inference rules are admissible, but not the other
way around.
The notion of a formal system encapsulates the most general way we can conceptualize a system
with a calculus, i.e. a system in which we can do “formal reasoning”.
Chapter 15
15.1 Introduction
A Video Nugget covering this section can be found at https://fau.tv/clip/id/25019.
Definition 15.1.2. Tools addressing SAT are commonly referred to as SAT solvers.
237
238 CHAPTER 15. PROPOSITIONAL REASONING: SAT SOLVERS
Upshot: Anything we can do with CSP, we can (in principle) do with SAT.
2 bits x1 and x0 ; c = 2 ∗ x1 + x0 .
(FF=
b Flip-Flop, D =
b Data IN, CLK =
b Clock)
To Verify: If c < 3 in current clock cycle,
then c < 3 in next clock cycle.
The answer is “no”. And in some cases we can figure out exactly when they
are/aren’t hard to solve.
/∗ Termination Test: ∗/
if 2∈∆′ then return ‘‘unsatisfiable’’
if ∆′ = {} then return I ′
/∗ Splitting Rule: ∗/
select some proposition P for which I ′ is not defined
I ′′ := I ′ extended with one truth value for P ; ∆′′ := a copy of ∆′ ; simplify ∆′′
if I ′′′ := DPLL(∆′′ ,I ′′ ) ̸= ‘‘unsatisfiable’’ then return I ′′′
I ′′ := I ′ extended with the other truth value for P ; ∆′′ := ∆′ ; simplify ∆′′
return DPLL(∆′′ ,I ′′ )
Example 15.2.2 (UP and Splitting). Let ∆:=(P T ∨QT ∨RF ;P F ∨QF ;RT ;P T ∨QF )
1. UP Rule: R7→T
P T ∨ QT ; P F ∨ QF ; P T ∨ QF
2. Splitting Rule:
2a. P 7→F 2b. P 7→T
QT ; QF QF
3a. UP Rule: Q7→T 3b. UP Rule: Q7→F
2 clause set empty
returning “unsatisfiable” returning “R7→T, P 7→T, Q7→F
2. UP Rule: Q7→T
P F ; P T ∨ RF ; RT
3. UP Rule: R7→T
PF ; PT
4. UP Rule: P 7→T
2
P F
T
X1
T F
Xn Xn
T F T F
Q Q Q Q
T F T F T F T F
Properties of DPLL
Unsatisfiable case: What can we say if “unsatisfiable” is returned?
In this case, we know that ∆ is unsatisfiable: Unit propagation is sound, in the
sense that it does not reduce the set of solutions.
Satisfiable case: What can we say when a partial interpretation I is returned?
Any extension of I to a complete interpretation satisfies ∆. (By construction,
I suffices to satisfy all clauses.)
UP =
b Unit Resolution
Observation: The unit propagation (UP) rule corresponds to a calculus:
while ∆′ contains a unit clause {l} do
extend I ′ with the respective truth value for the proposition underlying l
simplify ∆′ /∗ remove false literals ∗/
Definition 15.3.1 (Unit Resolution). Unit resolution (UR) is the test calculus
consisting of the following inference rule:
C ∨ P α P β α ̸= β
UR
C
Unit propagation =
b resolution restricted to cases where one parent is unit clause.
UR makes only limited inferences, as long as there are unit clauses. It does not
guarantee to infer everything that can be inferred.
Example 15.3.7. We follow the steps in the proof of Theorem 15.3.6 for ∆:=(QF ∨
P F ; P T ∨ QF ∨ RF ∨ S F ; QT ∨ S F ; RT ∨ S F ; S T )
DPLL: (Without UP; leaves an- Resolution proof from that DPLL tree:
notated with clauses that became
empty)
S 2
F
T
Q ST SF ST
F
T
R QT ∨ S F QF ∨ S F QT ∨ S F
F
T
P F
RT ∨ S F QF ∨ RF ∨ S F RT ∨ S F
T
Q ∨ P F P T ∨ QF ∨ RF ∨ S F
F
QF ∨ P F P T ∨ QF ∨ RF ∨ S F
7. Due to (1), we have (b) for N k . But we do not necessarily have (a): C(N ) ⊆ {L1 , . . . , Lk },
but there are cases where Lk ̸∈C(N ) (e.g., if X k is not contained in any clause and thus
branching over it was completely unnecessary). If so, however, we can simply remove N k and
all its descendants from the tree as well. We attach C(N ) at the L(k−1) branch of N (k−1) |,
in the role of C(N (k−1) , L(k−1) ). If L(k−1) ∈C(N ) then we have (a) for N ′ := N (k−1) and
can stop. If L(k−1) F ̸∈C(N ), then we remove N (k−1) and so forth, until either we stop with
(a), or have removed N 1 and thus must already have derived the empty clause (because
C(N ) ⊆ {L1 , . . . , Lk }\{L1 , . . . , Lk }).
8. Unit propagation can be simulated via applications of the splitting rule, choosing a proposi-
tion that is constrained by a unit clause: One of the two truth values then immediately yields
an empty clause.
In fact: DPLL =
b tree resolution.
Definition 15.3.9. In a tree resolution, each derived clause C is used only once
(at its parent).
Problem: The same C must be derived anew every time it is used!
This is a fundamental weakness: There are inputs ∆ whose shortest tree reso-
lution proof is exponentially longer than their shortest (general) resolution proof.
Intuitively: DPLL makes the same mistakes over and over again.
Idea: DPLL should learn from its mistakes on one search branch, and apply the
learned knowledge to other branches.
To the rescue: clause learning (up next)
P F
T
X1
T F
Xn Xn
T F T F
Q Q Q Q
T F T F T F T F
T T T T T T T T
R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2
Proof sketch: UP can’t derive l′ whose value was already set beforehand.
Intuition: The initial vertices are the choice literals and unit clauses of ∆.
2a. P 7→F
Choice literal P F .
QT ; QF QT
3a. UP Rule: Q7→T
Implied literal QT
edges (RT ,QT ) and (P F ,QT ).
2
RT 2P T ∨QF
Conflict vertex 2P T ∨QF
edges (P F ,2P T ∨QF ) and (QT ,2P T ∨QF ).
P F
T
X1
T F
Xn Xn
T F T F
Q Q Q Q
T F T F T F T F
T T T T T T T T
R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2
PT
QT X1 T ... Xn T
RT 2P T ∨QF ∨RT
∆ := P F ∨ QF ∨ RT ; P F ∨ QF ∨ RF ; P F ∨ QT ∨ RT ; P F ∨ QT ∨ RF
Θ := X 1 T ∨ . . . ∨ X n T ; X 1 F ∨ . . . ∨ X n F
PT
QT X1 T ... Xn T
RT 2P T ∨QF ∨RT ST 2
248 CHAPTER 15. PROPOSITIONAL REASONING: SAT SOLVERS
It depends on “ordering decisions” during UP: Which unit clause is picked first.
Example 15.4.8. ∆ = P F ∨ QF ; QT ; P T
Option 1 Option 2
T
Q PT
2P F ∨QF PF 2P F ∨QF QF
Conflict Graphs
A conflict graph captures “what went wrong” in a failed node.
Definition 15.4.9 (Conflict Graph). Let ∆ be a clause set, and let Gimpl
β be the
implication graph for some search branch β of DPLL on ∆. A subgraph C of Gimpl
β
is a conflict graph if:
(i) C contains exactly one conflict vertex 2C .
(ii) If l′ is a vertex in C, then all parents of l′ , i.e. vertices li with a I edge (li ,l′ ),
are vertices in C as well.
(iii) All vertices in C have a path to 2C .
Conflict graph =
b Starting at a conflict vertex, backchain through the implication
graph until reaching choice literals.
PT
QT X1 T ... Xn T
RT 2P T ∨QF ∨RT
∆ := P F ∨ QF ∨ RT ; P F ∨ QF ∨ RF ; P F ∨ QT ∨ RT ; P F ∨ QT ∨ RF
Θ := X 1 T ∨ . . . ∨ X n T ; X 1 F ∨ . . . ∨ X n F
PT
QT X1 T ... Xn T
RT 2P T ∨QF ∨RT ST 2
PT
QT X1 T ... Xn T
RT 2P T ∨QF ∨RT ST 2
250 CHAPTER 15. PROPOSITIONAL REASONING: SAT SOLVERS
Clause Learning
Observation: Conflict graphs encode the entailment relation.
Definition 15.5.1. Let ∆ be a clause set, C be a conflict graph at some time
point
W during a run of DPLL on ∆, and L be the choice literals in C, then we call
c:= l∈L l the learned clause for C.
Theorem 15.5.2. Let ∆, C, and c as in Definition 15.5.1, then ∆ |= c.
Idea: We can add learned clauses to DPLL derivations at any time without losing
soundness. (maybe this helps, if we have a good notion of learned clauses)
PT
QT X1 T ... Xn T
RT 2P T ∨QF ∨RT
Learned clause: P F ∨ QF
Example 15.5.5. l1 = P , C = P F ∨ QF , l′ = Q.
Observation:
Given the earlier choices l1 , . . . , lk , after we learned the new clause C = l1 ∨ . . . ∨
lk ∨ l′ , the value of l′ is now set by UP!
So we can continue:
3. We set the opposite choice l′ as an implied literal.
e.g. QF as an implied literal.
4. We run UP and analyze conflicts.
Learned clause: earlier choices only! e.g. C = P F , see next slide.
∆ := P F ∨ QF ∨ RT ; P F ∨ QF ∨ RF ; P F ∨ QT ∨ RT ; P F ∨ QT ∨ RF
Θ := X 1 T ∨ . . . ∨ X 100 T ; X 1 F ∨ . . . ∨ X 100 F
PT
QF X1 T ... Xn T
RT 2
252 CHAPTER 15. PROPOSITIONAL REASONING: SAT SOLVERS
Learned clause: P F
P F
T
X1
T
Xn
T
Q
T F set by UP
T T
R ;2 R ;2
learn P F ∨ QF learn P F
Note: Here, the problem could be avoided by splitting over different variables.
Problem: This is not so in general! (see next slide)
Recall: DPLL =
b tree resolution (from slide 388)
1. in particular: each derived clause C (not in ∆) is derived anew every time it is
used.
2. Problem: there are ∆ whose shortest tree resolution proof is exponentially longer
than their shortest (general) resolution proof.
Remarks
Which clause(s) to learn?:
Imagine I gave you as homework to make a formula family {φ} where DPLL running
time necessarily is in the order of O(2n ).
I promise you’re not gonna find this easy . . . (although it is of course possible:
e.g., the “Pigeon Hole Problem”).
People noticed by the early 90s that, in practice, the DPLL worst case does not
tend to happen.
Modern SAT solvers successfully tackle practical instances where n > 1.000.000.
Difficulty 1: What is the “typical case” in applications? E.g., what is the “average”
Hardware Verification instance?
Consider precisely defined random distributions instead.
Difficulty 2: Search trees get very complex, and are difficult to analyze math-
ematically, even in trivial examples. Never mind examples of practical relevance
...
The most successful works are empirical. (Interesting theory is mainly concerned
with hand-crafted formulas, like the Pigeon Hole Problem.)
15.7 Conclusion
A Video Nugget covering this section can be found at https://fau.tv/clip/id/25090.
Summary
SAT solvers decide satisfiability of CNF formulas. This can be used for deduction,
and is highly successful as a general problem solving technique (e.g., in Verification).
DPLL = b backtracking with inference performed by unit propagation (UP), which
iteratively instantiates unit clauses and simplifies the formula.
15.7. CONCLUSION 257
end for
end for
return ‘‘no satisfying assignment found’’
local search is not as successful in SAT applications, and the underlying ideas are
very similar to those presented in section 8.6 (Not covered here)
Resolution special cases: There’s a universe in between unit resolution and full
resolution: trade off inference vs. search.
Proof complexity: Can one resolution special case X simulate another one Y
polynomially? Or is there an exponential separation (example families where X is
exponentially less effective than Y )?
Suggested Reading:
• Chapter 7: Logical Agents, Section 7.6.1 [RN09].
– Here, RN describe DPLL, i.e., basically what I cover under “The Davis-Putnam (Logemann-
Loveland) Procedure”.
– That’s the only thing they cover of this Chapter’s material. (And they even mark it as “can
be skimmed on first reading”.)
– This does not do the state of the art in SAT any justice.
• Chapter 7: Logical Agents, Sections 7.6.2, 7.6.3, and 7.7 [RN09].
– Sections 7.6.2 and 7.6.3 say a few words on local search for SAT, which I recommend as
additional background reading. Section 7.7 describes in quite some detail how to build an
agent using propositional logic to take decisions; nice background reading as well.
Chapter 16
Let’s
Let’s Talk
Talk About
AboutBlocks,
Blocks,Baby
Baby. .... .
Question: What do you see here?
I Question: What do you see here?
A D B E C
I You say: “All blocks are red”; “All blocks are on the table”; “A is a block”.
You say: “All blocks are red”; “All blocks are on the table”; “A is a block”.
I And now: Say it in propositional logic!
And now: Say it in propositional logic!
Answer: “isRedA”,“isRedB”, . . . , “onTableA”, “onTableB”, . . . , “isBlockA”, . . .
Wait a sec!: Why don’t we just say, e.g., “AllBlocksAreRed” and “isBlockA”?
Problem: Could we conclude that A is red? (No)
These statements are atomic (just strings); their inner structure (“all blocks”, “is a
block”) is not captured.
Idea: Predicate Logic (PL1 ) extends propositional logic with the ability to explicitly
speak about objects and their properties.
How?: Variables ranging over objects, predicates describing object properties, . . .
Kohlhase: Künstliche Intelligenz 1 416 July 5, 2018
Example 16.1.1. “∀x block(x) ⇒ red(x)”; “block(A)”
259
260 CHAPTER 16. FIRST-ORDER PREDICATE LOGIC
Note: Even when we can describe the problem suitably, for the desired reasoning,
the propositional formulation typically is way too large to write (by hand).
PL1 solution: “∀x Wumpus(x) ⇒ (∀y adj(x, y) ⇒ stench(y))”
Example 16.1.3.
There is a surjective function from the natural numbers into the reals.
First-Order Predicate Logic has many good properties (complete calculi,
compactness, unitary, linear unification,. . . )
But too weak for formalizing: (at least directly)
We make the deliberate, but non-standard design choice here to include Skolem constants into
the signature from the start. These are used in inference systems to give names to objects and
construct witnesses. Other than the fact that they are usually introduced by need, they work
exactly like regular constants, which makes the inclusion rather painless. As we can never predict
how many Skolem constants we are going to need, we give ourselves countably infinitely many for
every arity. Our supply of individual variables is countably infinite for the same reason. The
formulae of first-order logic is built up from the signature and variables as terms (to represent
individuals) and propositions (to represent propositions). The latter include the propositional
connectives, but also quantifiers.
if p∈Σpk and Ai ∈wff ι (Σι , Vι ) for i≤k, then p(A1 , . . ., Ak )∈wff o (Σι , Vι ),
if A, B∈wff o (Σι , Vι ) and X∈Vι , then T , A ∧ B, ¬A, ∀X A∈wff o (Σι , Vι ). ∀ is
a binding operator called the universal quantifier.
Note: that we only need e.g. conjunction, negation, and universal quantification, all other
logical constants can be defined from them (as we will see when we have fixed their interpreta-
16.2. FIRST-ORDER LOGIC 265
tions).
Here Elsewhere
V
∀x A x A (x)A
W
∃x A xA
The introduction of quantifiers to first-order logic brings a new phenomenon: variables that are
under the scope of a quantifiers will behave very differently from the ones that are not. Therefore
we build up a vocabulary that distinguishes the two.
free(X):={X}
free(f (A1 , . . ., An )):=S 1≤i≤n free(Ai )
S
free(p(A1 , . . ., An )):= 1≤i≤n free(Ai )
free(¬A):=free(A)
free(A ∧ B):=free(A) ∪ free(B)
free(∀X A):=free(A)\{X}
We will be mainly interested in (sets of) sentences – i.e. closed propositions – as the representations
of meaningful statements about individuals. Indeed, we will see below that free variables do
not gives us expressivity, since they behave like constants and could be replaced by them in all
situations, except the recursive definition of quantified formulae. Indeed in all situations where
variables occur freely, they have the character of meta-variables, i.e. syntactic placeholders that
can be instantiated with terms when needed in an inference calculus.
The semantics of first-order logic is a Tarski-style set-theoretic semantics where the atomic syn-
tactic entities are interpreted by mapping them into a well-understood structure, a first-order
universe that is just an arbitrary set.
Definition 16.2.13. We inherit the universe Do = {T, F} of truth values from PL0
and assume an arbitrary universe Dι ̸= ∅ of individuals (this choice is a parameter
to the semantics)
Definition 16.2.14. An interpretation I assigns values to constants, e.g.
We do not have to make the universe of truth values part of the model, since it is always the same;
we determine the model by choosing a universe and an interpretation function.
Given a first-order model, we can define the evaluation function as a homomorphism over the
construction of formulae.
The only new (and interesting) case in this definition is the quantifier case, there we define the
value of a quantified formula by the value of its scope – but with an extension of the incoming
variable assignment. Note that by passing to the scope A of ∀x A, the occurrences of the variable
x in A that were bound in ∀x A become free and are amenable to evaluation by the variable
assignment ψ:=φ,[a/X]. Note that as an extension of φ, the assignment ψ supplies exactly the
right value for x in A. This variability of the variable assignment in the definition of the value
16.2. FIRST-ORDER LOGIC 267
function justifies the somewhat complex setup of first-order evaluation, where we have the (static)
interpretation function for the symbols from the signature and the (dynamic) variable assignment
for the variables.
Note furthermore, that the value I φ (∃x A) of ∃x A, which we have defined to be ¬(∀x ¬A) is
true, iff it is not the case that I φ (∀x ¬A) = I ψ (¬A) = F for all a∈Dι and ψ:=φ,[a/X]. This is
the case, iff I ψ (A) = T for some a∈Dι . So our definition of the existential quantifier yields the
appropriate semantics.
Signature: Let Σf0 :={j, m}, Σf1 :={f }, and Σp2 :={o}
Universe: Dι :={J, M }
Interpretation: I(j):=J, I(m):=M , I(f )(J):=M , I(f )(M ):=M , and I(o):={(M ,J)}.
Then ∀X o(f (X), X) is a sentence and with ψ:=φ,[a/X] for a∈Dι we have
I φ (∀X o(f (X), X)) = T iff I ψ (o(f (X), X)) = T for all a∈Dι
iff (I ψ (f (X)),I ψ (X))∈I(o) for all a∈{J, M }
iff (I(f )(I ψ (X)),ψ(X))∈{(M ,J)} for all a∈{J, M }
iff (I(f )(ψ(X)),a) = (M ,J) for all a∈{J, M }
iff I(f )(a) = M and a = J for all a∈{J, M }
But a ̸= J for a = M , so I φ (∀X o(f (X), X)) = F in the model ⟨Dι , I⟩.
Substitutions on Terms
Intuition: If B is a term and X is a variable, then we denote the result of
systematically replacing all occurrences of X in a term A by B with [B/X](A).
Problem: What about [Z/Y ], [Y /X](X), is that Y or Z?
Folklore: [Z/Y ], [Y /X](X) = Y , but [Z/Y ]([Y /X](X)) = Z of course.
(Parallel application)
Definition 16.2.20.[for=sbstListfromto,sbstListdots,sbst]
Let wfe(Σ, V) be an expression language, then we call σ : V→wfe(Σ, V) a substi-
tution, iff the support supp(σ):={X|(X,A)∈σ, X ̸= A} of σ is finite. We denote
the empty substitution with ϵ.
Example 16.2.22. [a/x], [f (b)/y], [a/z] instantiates g(x, y, h(z)) to g(a, f (b), h(a)).
S
Definition 16.2.23. Let σ be a substitution then we call intro(σ):= X∈supp(σ) free(σ(X))
the set of variables introduced by σ.
The extension of a substitution is an important operation, which you will run into from time
to time. Given a substitution σ, a variable x, and an expression A, σ,[A/x] extends σ with a
new value for x. The intuition is that the values right of the comma overwrite the pairs in the
substitution on the left, which already has a value for x, even though the representation of σ may
not show it.
Substitution Extension
Definition 16.2.24 (Substitution Extension).
Let σ be a substitution, then we denote the extension of σ with [A/X] by σ,[A/X]
and define it as {(Y ,B)∈σ|Y ̸= X} ∪ {(X,A)}: σ,[A/X] coincides with σ off X,
and gives the result A there.
Note: If σ is a substitution, then σ,[A/X] is also a substitution.
We also need the dual operation: removing a variable from the support:
Note that the use of the comma notation for substitutions defined in ?? is consistent with sub-
stitution extension. We can view a substitution [a/x], [f (b)/y] as the extension of the empty
substitution (the identity function on variables) by [f (b)/y] and then by [a/x]. Note furthermore,
that substitution extension is not commutative in general.
For first-order substitutions we need to extend the substitutions defined on terms to act on propo-
sitions. This is technically more involved, since we have to take care of bound variables.
Substitutions on Propositions
Problem: We want to extend substitutions to propositions, in particular to quan-
tified formulae: What is σ(∀X A)?
ation, whereas it was free before. Solution: Rename away the bound variable X
in ∀X p(X, Y ) before applying the substitution.
Definition 16.2.27 (Capture-Avoiding Substitution Application). Let σ be a
substitution, A a formula, and A′ an alphabetical variant of A, such that intro(σ)∩
BVar(A) = ∅. Then we define σ(A):=σ(A′ ).
We now introduce a central tool for reasoning about the semantics of substitutions: the “sub-
stitution value Lemma”, which relates the process of instantiation to (semantic) evaluation. This
result will be the motor of all soundness proofs on axioms and inference rules acting on variables
via substitutions. In fact, any logic with variables and substitutions will have (to have) some form
of a substitution value Lemma to get the meta-theory going, so it is usually the first target in any
development of such a logic.
We establish the substitution-value Lemma for first-order logic in two steps, first on terms,
where it is very simple, and then on propositions.
by inductive hypothesis
2.2. This completes the inductive case, and we have proven the assertion.
To understand the proof fully, you should think about where the WLOG – it stands for without
loss of generality comes from.
A ∀X A
∀I ∗ ∀E
∀X A [B/X](A)
1
[[c/X](A)]
..
∃X A . 0 new
c∈Σsk
[B/X](A) C
∃I ∃E 1
∃X A C
∗
means that A does not depend on any hypothesis in which X is free.
The intuition behind the rule ∀I is that a formula A with a (free) variable X can be generalized
to ∀X A, if X stands for an arbitrary object, i.e. there are no restricting assumptions about X.
16.3. FIRST-ORDER NATURAL DEDUCTION 271
The ∀E rule is just a substitution rule that allows to instantiate arbitrary terms B for X in A.
The ∃I rule says if we have a witness B for X in A (i.e. a concrete term B that makes A true),
then we can existentially close A. The ∃E rule corresponds to the common mathematical practice,
where we give objects we know exist a new name c and continue the proof by reasoning about this
concrete object c. Anything we can prove from the assumption [c/X](A) we can prove outright if
∃X A is known.
[¬P (X)]2
∃I
[¬(∃X ¬P (X))]1 ∃X ¬P (X)
FI
F
¬I 2
¬¬P (X)
¬E
P (X)
∀I
¬(∀X P (X)) ∀X P (X)
FI
F
¬I 1
¬¬(∃X ¬P (X))
¬E
∃X ¬P (X)
Now we reformulate the classical formulation of the calculus of natural deduction as a sequent
calculus by lifting it to the “judgements level” as we die for propositional logic. We only need
provide new quantifier rules.
Definition 16.3.3 (New Quantifier Rules). The inference rules of the first-order
sequent calculus ND⊢1 consist of those from ND⊢0 plus the following quantifier rules:
Γ⊢[B/X](A) 0 new
Γ⊢∃X A Γ, [c/X](A)⊢C c∈Σsk
∃I ∃E
Γ⊢∃X A Γ⊢C
Definition 16.3.4 (First-Order Logic with Equality). We extend PL1 with a new
logical symbol for equality = ∈Σp2 and fix its semantics to I(=):={(x,x)|x∈Dι }.
We call the extended logic first-order logic with equality (PL1= )
We now extend natural deduction as well.
Definition 16.3.5. For the calculus of natural deduction with equality (ND=
1
) we
add the following two rules to ND1 to deal with equality:
A = B C [A]p
=I =E
A=A [B/p]C
where C [A]p if the formula C has a subterm A at position p and [B/p]C is the
result of replacing that subterm with B.
In many ways equivalence behaves like equality, we will use the following rules in
ND1
Definition 16.3.6. ⇔I is derivable and ⇔E is admissible in ND1 :
A ⇔ B C [A]p
⇔I ⇔E
A⇔A [B/p]C
Again, we have two rules that follow the introduction/elimination pattern of natural deduction
calculi. To make sure that we understand the constructions here, let us get back to the
“replacement at position” operation used in the equality rules.
Positions in Formulae
Idea: Formulae are (naturally) trees, so we can use tree positions to talk about
subformulae
Definition 16.3.7. A position p is a tuple of natural numbers that in each node
of a expression (tree) specifies into which child to descend. For a expression A we
denote the subexpression at p with A|p .
We will sometimes write a expression C as C [A]p to indicate that C the subex-
pression A at position p.
Definition 16.3.8. Let p be a position, then [A/p]C is the expression obtained
from C by replacing the subexpression at p by A.
Example 16.3.9 (Schematically).
C [B/p]C
p p
A = C|p B
16.3. FIRST-ORDER NATURAL DEDUCTION 273
The operation of replacing a subformula at position p is quite different from e.g. (first-order)
substitutions:
• We are replacing subformulae with subformulae instead of instantiating variables with terms.
• substitutions replace all occurrences of a variable in a formula, whereas formula replacement
only affects the (one) subformula at position p.
We conclude this section with an extended example: the proof of a classical mathematical result
in the natural deduction calculus with equality. This shows us that we can derive strong properties
about complex situations (here the real numbers; an uncountably infinite set of numbers).
1
√
ND= Example: 2 is Irrational
If we want to formalize this into ND1 , we have to write down all the assertions in the proof steps
in PL1 syntax and come up with justifications for them in terms of ND1 inference rules. The next
two slides show such a proof, where we write ′n to denote that n is prime, use #(n) for the number
of prime factors of a number n, and write irr(r) if r is irrational.
1
√
ND= Example: 2 is Irrational (the Proof)
Lines 6 and 9 are local hypotheses for the proof (they only have an implicit counterpart in the
inference rules as defined above). Finally we have abbreviated the arithmetic simplification of line
9 with the justification “arith” to avoid having to formalize elementary arithmetic.
1
√
ND= Example: 2 is Irrational (the Proof continued)
13 prime(2) lemma
14 6,9 #(2q 2 ) = #(q 2 ) + 1 ⇒E(13, 12)
15 6,9 #(q 2 ) = 2#(q) ∀E 2 (2)
16 6,9 #(2q 2 ) = 2#(q) + 1 =E(14, 15)
17 #(p2 ) = #(p2 ) =I
18 6,9 #(2q 2 ) = #(q 2 ) =E(17, 10)
19 6.9 2#(q) + 1 = #(p2 ) =E(18, 16)
20 6.9 2#(q) + 1 = 2#(p) =E(19, 11)
21 6.9 ¬(2#(q) + 1) = (2#(p)) ∀E 2 (1)
22 6,9 F F I(20, 21)
23 6 F √ ∃E 6 (22)
24 ¬¬irr(
√ 2) ¬I 6 (23)
25 irr( 2) ¬E 2 (23)
We observe that the ND1 proof is much more detailed, and needs quite a few Lemmata about
# to go through. Furthermore, we have added a definition of irrationality (and treat definitional
equality via the equality rules). Apart from these artefacts of formalization, the two representations
of proofs correspond to each other very directly.
16.4 Conclusion
Summary (Predicate Logic)
Predicate logic allows to explicitly speak about objects and their properties. It is
thus a more natural and compact representation language than propositional logic;
it also enables us to speak about infinite sets of objects.
Logic has thousands of years of history. A major current application in AI is Se-
mantic Technology.
First-order predicate logic (PL1) allows universal and existential quantification over
objects.
A PL1 interpretation consists of a universe U and a function I mapping constant
symbols/predicate symbols/function symbols to elements/relations/functions on U .
Suggested Reading:
– A less formal account of what I cover in “Syntax” and “Semantics”. Contains different exam-
ples, and complementary explanations. Nice as additional background reading.
• Sections 8.3 and 8.4 provide additional material on using PL1, and on modeling in PL1, that I
don’t cover in this lecture. Nice reading, not required for exam.
• Excursion: A full analysis of any calculus needs a completeness proof. We will not cover this
in AI-2, but provide one for the calculi introduced so far in??.
276 CHAPTER 16. FIRST-ORDER PREDICATE LOGIC
Chapter 17
In this chapter, we take up the machine-oriented calculi for propositional logic from chapter 13
and extend them to the first-order case. While this has been relatively easy for the natural
deduction calculus – we only had to introduce the notion of substitutions for the elimination rule
for the universal quantifier we have to work much more here to make the calculi effective for
implementation.
277
278 CHAPTER 17. AUTOMATED THEOREM PROVING IN FIRST-ORDER LOGIC
Algorithm: Fully expand all possible tableaux, (no rule can be applied)
Satisfiable, iff there are open branches (correspond to models)
Tableau calculi develop a formula in a tree-shaped arrangement that represents a case analysis on
when a formula can be made true (or false). Therefore the formulae are decorated with exponents
that hold the intended truth value.
On the left we have a refutation tableau that analyzes a negated formula (it is decorated with the
intended truth value F). Both branches contain an elementary contradiction ⊥.
On the right we have a model generation tableau, which analyzes a positive formula (it is
decorated with the intended truth value T. This tableau uses the same rules as the refutation
tableau, but makes a case analysis of when this formula can be satisfied. In this case we have a
closed branch and an open one, which corresponds a model).
Now that we have seen the examples, we can write down the tableau rules formally.
Aα
T F α ̸= β
(A ∧ B) (A ∧ B) ¬A T
¬A F
Aβ
T0 ∧
T0 ∨ T0 ¬T T0 ¬F T0 ⊥
AT AF BF AF AT ⊥
BT
These inference rules act on tableaux have to be read as follows: if the formulae over the line
appear in a tableau branch, then the branch can be extended by the formulae or branches below
the line. There are two rules for each primary connective, and a branch closing rule that adds the
special symbol ⊥ (for unsatisfiability) to a branch.
We use the tableau rules with the convention that they are only applied, if they contribute new
material to the branch. This ensures termination of the tableau procedure for propositional logic
(every rule eliminates one primary connective).
Definition 17.1.5. We will call a closed tableau with the labeled formula Aα at the root a
tableau refutation for Aα .
17.1. FIRST-ORDER INFERENCE WITH TABLEAUX 279
The saturated tableau represents a full case analysis of what is necessary to give A the truth value
α; since all branches are closed (contain contradictions) this is impossible.
Definition 17.1.7. We will call a tableau refutation for AF a tableau proof for A, since it refutes
the possibility of finding a model where A evaluates to F. Thus A must evaluate to T in all
models, which is just our definition of validity.
Thus the tableau procedure can be used as a calculus for propositional logic. In contrast to the
propositional Hilbert calculus it does not prove a theorem A by deriving it from a set of axioms,
but it proves it by refuting its negation. Such calculi are called negative or test calculi. Generally
negative calculi have computational advantages over positive ones, since they have a built-in sense
of direction.
We have rules for all the necessary connectives (we restrict ourselves to ∧ and ¬, since the others
can be expressed in terms of these two via the propositional identities above. For instance, we can
write A ∨ B as ¬(¬A ∧ ¬B), and A ⇒ B as ¬A ∨ B,. . . .)
We will now extend the propositional tableau techniques to first-order logic. We only have to add
two new rules for the universal quantifiers (in positive and negative polarity).
The rule T1 ∀ operationalizes the intuition that a universally quantified formula is true, iff all
of the instances of the scope are. To understand the T1 ∃ rule, we have to keep in mind that
F T
∃X A abbreviates ¬(∀X ¬A), so that we have to read (∀X A) existentially — i.e. as (∃X ¬A) ,
stating that there is an object with property ¬A. In this situation, we can simply give this
object a name: c, which we take from our (infinite) set of witness constants Σsk 0 , which we have
given ourselves expressly for this purpose when we defined first-order syntax. In other words
T F
([c/X](¬A)) = ([c/X](A)) holds, and this is just the conclusion of the T1 ∃ rule.
Note that the T1 ∀ rule is computationally extremely inefficient: we have to guess an (i.e. in a
search setting to systematically consider all) instance C∈wff ι (Σι , Vι ) for X. This makes the rule
infinitely branching.
280 CHAPTER 17. AUTOMATED THEOREM PROVING IN FIRST-ORDER LOGIC
In the next calculus we will try to remedy the computational inefficiency of the T1 ∀ rule. We do
this by delaying the choice in the universal rule.
Definition 17.1.9. The free variable tableau calculus (T1f ) extends T0 (proposi-
tional tableau calculus) with the quantifier rules:
Aα
α ̸= β σ(A) = σ(B)
Bβ
T1f⊥
⊥:σ
Metavariables: Instead of guessing a concrete instance for the universally quantified variable
as in the T1 ∀ rule, T1f ∀ instantiates it with a new meta-variable Y , which will be instantiated by
need in the course of the derivation.
Skolem terms as witnesses: The introduction of meta-variables makes is necessary to extend
the treatment of witnesses in the existential rule. Intuitively, we cannot simply invent a new name,
since the meaning of the body A may contain meta-variables introduced by the T1f ∀ rule. As we
do not know their values yet, the witness for the existential statement in the antecedent of the
T1f ∃ rule needs to depend on that. So witness it using a witness term, concretely by applying a
Skolem function to the meta-variables in A.
Instantiating Metavariables: Finally, the T1f⊥ rule completes the treatment of meta-variables,
it allows to instantiate the whole tableau in a way that the current branch closes. This leaves us
with the problem of finding substitutions that make two terms equal.
Let’s Talk
Tableau Aboutabout
Reasons Blocks, Baby . . .
Blocks
Example 17.1.11 (Reasoning about Blocks). Returing to slide 418
I Question: What do you see here?
A D B E C
I You say: “All blocks are red”; “All blocks are on the table”; “A is a block”.
Can we prove red(A) from ∀x block(x) ⇒ red(x) and block(A)?
I And now: Say it in propositional logic!
T
(∀X block(X) ⇒ red(X))
T
block(A)
F
red(A)
T
(block(Y ) ⇒ red(Y ))
F T
block(Y ) red(A)
⊥ : [A/Y ] ⊥
Unification (Definitions)
Definition 17.1.12. For given terms A and B, unification is the problem of finding
a substitution σ, such that σ(A) = σ(B).
Notation: We write term pairs as A=?B e.g. f (X)=?f (g(Y )).
282 CHAPTER 17. AUTOMATED THEOREM PROVING IN FIRST-ORDER LOGIC
The idea behind a most general unifier is that all other unifiers can be obtained from it by (further)
instantiation. In an automated theorem proving setting, this means that using most general
unifiers is the least committed choice — any other choice of unifiers (that would be necessary for
completeness) can later be obtained by other substitutions.
Note that there is a subtlety in the definition of the ordering on substitutions: we only compare
on a subset of the variables. The reason for this is that we have defined substitutions to be total
on (the infinite set of) variables for flexibility, but in the applications (see the definition of most
general unifiers), we are only interested in a subset of variables: the ones that occur in the initial
problem formulation. Intuitively, we do not care what the unifiers do off that set. If we did
not have the restriction to the set W of variables, the ordering relation on substitutions would
become much too fine-grained to be useful (i.e. to guarantee unique most general unifiers in our
case).
Now that we have defined the problem, we can turn to the unification algorithm itself. We
will define it in a way that is very similar to logic programming: we first define a calculus that
generates “solved forms” (formulae from which we can read off the solution) and reason about
control later. In this case we will reason that control does not matter.
Unification Problems (=
b Equational Systems)
Idea: Unification is equation solving.
Definition 17.1.16.
We call a formula A1=?B1 ∧. . .∧An=?Bn an unification problem iff Ai , Bi ∈wff ι (Σι , Vι ).
Note: We consider unification problems as sets of equations (∧ is ACI), and
equations as two-element multisets (=? is C).
In principle, unification problems are sets of equations, which we write as conjunctions, since
all of them have to be solved for finding a unifier. Note that it is not a problem for the “logical
view” that the representation as conjunctions induces an order, since we know that conjunction
is associative, commutative and idempotent, i.e. that conjuncts do not have an intrinsic order or
multiplicity, if we consider two equational problems as equal, if they are equivalent as propositional
formulae. In the same way, we will abstract from the order in equations, since we know that the
equality relation is symmetric. Of course we would have to deal with this somehow in the imple-
17.1. FIRST-ORDER INFERENCE WITH TABLEAUX 283
mentation (typically, we would implement equational problems as lists of pairs), but that belongs
into the “control” aspect of the algorithm, which we are abstracting from at the moment.
It is essential to our “logical” analysis of the unification algorithm that we arrive at unification prob-
lems whose unifiers we can read off easily. Solved forms serve that need perfectly as Lemma 17.1.21
shows.
Given the idea that unification problems can be expressed as formulae, we can express the algo-
rithm in three simple rules that transform unification problems into solved forms (or unsolvable
ones).
Unification Algorithm
Definition 17.1.22. The inference system U consists of the following rules:
The decomposition rule Udec is completely straightforward, but note that it transforms one unifi-
cation pair into multiple argument pairs; this is the reason, why we have to directly use unification
problems with multiple pairs in U.
Note furthermore, that we could have restricted the Utriv rule to variable-variable pairs, since
for any other pair, we can decompose until only variables are left. Here we observe, that constant-
constant pairs can be decomposed with the Udec rule in the somewhat degenerate case without
arguments.
Finally, we observe that the first of the two variable conditions in Uelim (the “occurs-in-check”)
makes sure that we only apply the transformation to unifiable unification problems, whereas the
second one is a termination condition that prevents the rule to be applied twice.
The notion of completeness and correctness is a bit different than that for calculi that we
compare to the entailment relation. We can think of the “logical system of unifiability” with
the model class of sets of substitutions, where a set satisfies an equational problem E, iff all of
its members are unifiers. This view induces the soundness and completeness notions presented
above.
The three meta-properties above are relatively trivial, but somewhat tedious to prove, so we leave
the proofs as an exercise to the reader.
We now fortify our intuition about the unification calculus by two examples. Note that we only
need to pursue one possible U derivation since we have confluence.
Unification Examples
Example 17.1.27. Two similar unification problems:
We will now convince ourselves that there cannot be any infinite sequences of transformations in
U. Termination is an important property for an algorithm.
The proof we present here is very typical for termination proofs. We map unification problems
into a partially ordered set ⟨S, ≺⟩ where we know that there cannot be any infinitely descending
sequences (we think of this as measuring the unification problems). Then we show that all trans-
formations in U strictly decrease the measure of the unification problems and argue that if there
were an infinite transformation in U, then there would be an infinite descending chain in S, which
contradicts our choice of ⟨S, ≺⟩.
The crucial step in in coming up with such proofs is finding the right partially ordered set.
Fortunately, there are some tools we can make use of. We know that ⟨N, <⟩ is terminating, and
there are some ways of lifting component orderings to complex structures. For instance it is well-
17.1. FIRST-ORDER INFERENCE WITH TABLEAUX 285
known that the lexicographic ordering lifts a terminating ordering to a terminating ordering on
finite dimensional Cartesian spaces. We show a similar, but less known construction with multisets
for our proof.
Unification (Termination)
Definition 17.1.28. Let S and T be multisets and ≤ a partial ordering on S ∪ T .
Then we define S ≺m S, iff S = C ⊎ T ′ and T = C ⊎ {t}, where s≤t for all s∈S ′ .
We call ≤m the multiset ordering induced by ≤.
Definition 17.1.29. We call a variable X solved in an unification problem E, iff E
contains a solved pair X=?A.
Lemma 17.1.30. If ≺ is linear/terminating on S, then ≺m is linear/terminating
on P(S).
Lemma 17.1.31. U is terminating. (any U-derivation is finite)
But it is very simple to create terminating calculi, e.g. by having no inference rules. So there
is one more step to go to turn the termination result into a decidability result: we must make sure
that we have enough inference rules so that any unification problem is transformed into solved
form if it is unifiable.
Proof:
1. U-irreducible unification problems can be reached in finite time by Lemma 17.1.31.
2. They are either solved or unsolvable by Lemma 17.1.33, so they provide the
answer.
Complexity of Unification
Observation: Naive implementations of unification are exponential in time and
space.
Indeed, the only way to escape this combinatorial explosion is to find representations of substitu-
tions that are more space efficient.
s3 t3 σ3 (t3 )
f f f
f
f f f f
f
x0 f f f
x1 x2 x3 x0
If we look at the unification algorithm from Definition 17.1.22 and the considerations in the
termination proof (Lemma 460) with a particular focus on the role of copying, we easily find the
culprit for the exponential blowup: Uelim, which applies solved pairs as substitutions.
We will now turn the ideas we have developed in the last couple of slides into a usable func-
tional algorithm. The starting point is treating terms as DAGs. Then we try to conduct the
transformation into solved form without adding new nodes.
Unification by DAG-chase
Idea: Extend the Input-DAGs by edges that represent unifiers.
write n.a, if a is the symbol of node n.
Algorithm dag−unify
Observation 17.1.40. dag−unify uses linear space, since no new nodes are created,
and at most one link per variable.
Problem: dag−unify still uses exponential time.
Example 17.1.41.
Consider terms f (sn , f (t′ n , xn )), f (tn , f (s′ n , y n ))), where s′ n = [y i /xi ](sn ) und
t′ n = [y i /xi ](tn ).
dag−unify needs exponentially many recursive calls to unify the nodes xn and y n .
(they are unified after n calls, but checking needs the time)
Algorithm uf−unify
This only needs linearly many recursive calls as it directly returns with true or makes
a node inaccessible for find.
Linearly many calls to linear procedures give quadratic running time.
Remark: There are versions of uf−unify that are linear in time and space, but for
most purposes, our algorithm suffices.
⊥ : [b/z]
After we have used up p(y) by applying [a/y] in T1f⊥, we have to get a new instance
F
Proof sketch: All T1f rules reduce the number of connectives and negative ∀ or the
multiplicity of positive ∀.
Theorem 17.1.46. T1f is only complete with unbounded multiplicities.
Proof sketch: Replace p(a) ∨ p(b) with p(a1 ) ∨ . . . ∨ p(an ) in Example 17.1.43.
Remark: Otherwise validity in PL1 would be decidable.
Implementation: We need an iterative multiplicity deepening process.
The other thing we need to realize is that there may be multiple ways we can use T1f⊥ to close a
290 CHAPTER 17. AUTOMATED THEOREM PROVING IN FIRST-ORDER LOGIC
branch in a tableau, and – as T1f⊥ instantiates the whole tableau and not just the branch itself –
this choice matters.
Treating T1f⊥
Example 17.1.47. Choosing which matters – this tableau does not close!
F
(∃x (p(a) ∧ p(b) ⇒ p()) ∧ (q(b) ⇒ q(x)))
F
((p(a) ∧ p(b) ⇒ p()) ∧ (q(b) ⇒ q(y)))
F F
(p(a) ⇒ p(b) ⇒ p()) (q(b) ⇒ q(y))
T T
p(a) q(b)
T F
p(b) q(y)
F
p(y)
⊥ : [a/y]
The method of spanning matings follows the intuition that if we do not have good information
on how to decide for a pair of opposite literals on a branch to use in T1f⊥, we delay the choice by
initially disregarding the rule altogether during saturation and then – in a later phase– looking
for a configuration of cuts that have a joint overall unifier. The big advantage of this is that we
only need to know that one exists, we do not need to compute or apply it, which would lead to
exponential blow-up as we have seen above.
Observation 17.1.48. T1f without T1f⊥ is terminating and confluent for given
multiplicities.
Idea: Saturate without T1f⊥ and treat all cuts at the same time (later).
Definition 17.1.49.
Let T be a T1f tableau, then we call a unification problem E:=A1=?B1 ∧ . . . ∧
An=?Bn a mating for T , iff Ai T and Bi F occur in the same branch in T .
We say that E is a spanning mating, if E is unifiable and every branch B of T
contains Ai T and Bi F for some i.
Theorem 17.1.50. A T1f -tableau with a spanning mating induces a closed T1
tableau.
Proof sketch: Just apply the unifier of the spanning mating.
17.2. FIRST-ORDER RESOLUTION 291
Excursion: Now that we understand basic unification theory, we can come to the meta-theoretical
properties of the tableau calculus. We delegate this discussion to??.
Excursion: Again, we relegate the meta-theoretical properties of the first-order resolution calculus
to??.
292 CHAPTER 17. AUTOMATED THEOREM PROVING IN FIRST-ORDER LOGIC
Remark: Modern resolution theorem provers prove this in less than 50ms.
Problem: That is only true, if we only give the theorem prover exactly the right
laws and background knowledge. If we give it all of them, it drowns in the combi-
natory explosion.
West is an American:
Clause: ami(West)
The country Nono is an enemy of America:
enmy(NN, USA)
[c/Y1 ]
mle(X2 )F ∨ own(NN, X2 )F ∨ sell(West, X2 , NN)T
love(g(jack), jack)T
Excursion: A full analysis of any calculus needs a completeness proof. We will not cover this in
the course, but provide one for the calculi introduced so far in??.
Definition 17.3.4. A Horn clause is a clause with at most one positive literal.
Recall: Backchaining as search:
state = tuple of goals; goal state = empty list (of goals).
next(⟨G, R1 , . . ., Rl ⟩):=⟨σ(B 1 ), . . ., σ(B m ), σ(R1 ), . . ., σ(Rl )⟩ if there is a rule
H:−B 1 ,. . ., B m . and a substitution σ with σ(H) = σ(G).
Note: Backchaining becomes resolution
PT ∨ A PF ∨ B
A∨B
positive, unit-resulting hyperresolution (PURR)
This observation helps us understand ProLog better, and use implementation techniques from
theorem proving.
To gain an intuition for this quite abstract definition let us consider a concrete knowledge base
about cars. Instead of writing down everything we know about cars, we only write down that cars
are motor vehicles with four wheels and that a particular object c has a motor and four wheels. We
can see that the fact that c is a car can be derived from this. Given our definition of a knowledge
base as the deductive closure of the facts and rule explicitly written down, the assertion that c is
a car is in the induced knowledge base, which is what we are after.
In this very simple example car(c) is about the only fact we can derive, but in general, knowledge
bases can be infinite (we will see examples below).
e.g. greek(sokrates),greek(perikles)
Question: Are there fallible greeks?
Indefinite answer: Yes, Perikles or Sokrates
Warning: how about Sokrates and Perikles?
e.g. greek(sokrates),roman(sokrates):−.
Query: Are there fallible greeks?
Answer: Yes, Sokrates, if he is not a roman
Is this abduction?????
299
300 CHAPTER 18. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB
According to an influential view of [PRR97], knowledge appears in layers. Staring with a character
set that defines a set of glyphs, we can add syntax that turns mere strings into data. Adding context
information gives information, and finally, by relating the information to other information allows
to draw conclusions, turning information into knowledge.
Note that we already have aspects of representation and function in the diagram at the top of the
slide. In this, the additional functionaltiy added in the successive layers gives the representations
more and more functions, until we reach the knowledge level, where the function is given by infer-
encing. In the second example, we can see that representations determine possible functions.
Let us now strengthen our intuition about knowledge by contrasting knowledge representations
from “regular” data structures in computation.
Answer:
No good reason other than AI practice, with the intuition that
data is simple and general (supports many algorithms)
knowledge is complex (has distinguished process model)
As knowledge is such a central notion in artificial intelligence, it is not surprising that there are
multiple approaches to dealing with it. We will only deal with the first one and leave the others
to self-study.
18.1. INTRODUCTION TO KNOWLEDGE REPRESENTATION 301
When assessing the relative strengths of the respective approaches, we should evaluate them with
respect to a pre-determined set of criteria.
KR Approaches/Evaluation Criteria
Definition 18.1.1. The evaluation criteria for knowledge representation approaches
are:
Expressive adequacy: What can be represented, what distinctions are supported.
Reasoning efficiency: Can the representation support processing that generates
results in acceptable speed?
Primitives: What are the primitive elements of representation, are they intuitive,
cognitively adequate?
Meta representation: Knowledge about knowledge
Completeness: The problems of reasoning with knowledge that is known to be
incomplete.
Even though the network in Example 18.1.3 is very intuitive (we immediately understand the
concepts depicted), it is unclear how we (and more importantly a machine that does not asso-
ciate meaning with the labels of the nodes and edges) can draw inferences from the “knowledge”
represented.
Idea: Links labeled with “isa” and “inst” are special: they propagate properties
encoded by other links.
Definition 18.1.6. We call links labeled by
“isa” an inclusion or isa link (inclusion of concepts)
“inst” instance or inst link (concept membership)
We now make the idea of “propagating properties” rigorous by defining the notion of derived
relations, i.e. the relations that are left implicit in the network, but can be added without changing
its meaning.
18.1. INTRODUCTION TO KNOWLEDGE REPRESENTATION 303
isa
bird / Jack Person
isa inst
inst inst
has_part robin owner_of Mary
has_part
has_part loves
wings John
Slogan: Get out more knowledge from a semantic networks than you put in.
Note that Definition 18.1.7 does not quite allow to derive that Jack is a bird (did you spot that
“isa” is not a relation that can be inferred?), even though we know it is true in the world. This
shows us that inference in semantic networks has be to very carefully defined and may not be
“complete”, i.e. there are things that are true in the real world that our inference procedure does
not capture.
Dually, if we are not careful, then the inference procedure might derive properties that are not
true in the real world even if all the properties explicitly put into the network are. We call such
an inference procedure unsound or incorrect.
These are two general phenomena we have to keep an eye on.
Another problem is that semantic nets (e.g. in in Example 18.1.3) confuse two kinds of concepts:
individuals (represented by proper names like John and Jack) and concepts (nouns like robin and
bird). Even though the isa and inst link already acknowledge this distinction, the “has_part” and
“loves” relations are at different levels entirely, but not distinguished in the networks.
can
animal move
TBox isa isa
amoeba
has_part higher animal has_part
legs head
isa isa
pattern eat color
striped tiger elephant gray
eat
ABox Roy eat Rex Clyde
In particular we have objects “Rex”, “Roy”, and “Clyde”, which have (derived) rela-
tions (e.g. Clyde is gray).
But there are severe shortcomings of semantic networks: the suggestive shape and node names
give (humans) a false sense of meaning, and the inference rules are only given in the process model
(the implementation of the semantic network processing system).
This makes it very difficult to assess the strength of the inference system and make assertions
e.g. about completeness.
Example 18.1.12. Consider a robin that has lost its wings in an accident:
has_part has_part
bird wings bird wings
isa isa
robin robin cancel
inst inst
jack joe
“Cancel-links” have been proposed, but their status and process model are debatable.
To alleviate the perceived drawbacks of semantic networks, we can contemplate another notation
that is more linear and thus more easily implemented: function/argument notation.
Evaluation:
+ linear notation (equivalent, but better to implement on a computer)
+ easy to give process model by deduction (e.g. in ProLog)
– worse locality properties (networks are associative)
Indeed the function/argument notation is the immediate idea how one would naturally represent
semantic networks for implementation.
This notation has been also characterized as subject/predicate/object triples, alluding to simple
(English) sentences. This will play a role in the “semantic web” later.
Building on the function/argument notation from above, we can now give a formal semantics for
semantic network: we translate them into first-order logic and use the semantics of that.
Indeed, the semantics induced by the translation to first-order logic, gives the intuitive meaning
to the semantic networks. Note that this only holds only for the features of semantic networks that
are representable in this way, e.g. the “cancel links” shown above are not (and that is a feature,
not a bug).
But even more importantly, the translation to first-order logic gives a first process model: we
can use first-order inference to compute the set of inferences that can be drawn from a semantic
network.
Humans understand the text and combine the information to get the answer. Ma-
chines need more than just text ; semantic web technology.
The term “semantic web” was coined by Tim Berners Lee in analogy to semantic networks, only
applied to the world wide web. And as for semantic networks, where we have inference processes
that allow us the recover information that is not explicitly represented from the network (here the
world-wide-web).
To see that problems have to be solved, to arrive at the semantic web, we will now look at a
concrete example about the “semantics” in web pages. Here is one that looks typical enough.
WWW2002
The eleventh International World Wide Web Conference
Sheraton Waikiki Hotel
Honolulu, Hawaii, USA
18.1. INTRODUCTION TO KNOWLEDGE REPRESENTATION 307
On the 7th May Honolulu will provide the backdrop of the eleventh
International World Wide Web Conference.
Speakers confirmed
Tim Berners-Lee: Tim is the well known inventor of the Web,
Ian Foster: Ian is the pioneer of the Grid, the next generation internet.
But as for semantic networks, what you as a human can see (“understand” really) is deceptive, so
let us obfuscate the document to confuse your “semantic processor”. This gives an impression of
what the computer “sees”.
R⌉}⟩∫⊔⌉∇⌉⌈√⊣∇⊔⟩⌋⟩√⊣\⊔∫⌋≀⇕⟩\}{∇≀⇕
A⊓∫⊔∇⊣↕⟩⊣⇔C⊣\⊣⌈⊣⇔C⟨⟩↕⌉D⌉\⇕⊣∇∥⇔F∇⊣\⌋⌉⇔G⌉∇⇕⊣\†⇔G⟨⊣\⊣⇔H≀\}K≀\}⇔I\⌈⟩⊣⇔
I∇⌉↕⊣\⌈⇔I⊔⊣↕†⇔J⊣√⊣\⇔M⊣↕⊔⊣⇔N⌉⊒Z⌉⊣↕⊣\⌈⇔T⟨⌉N⌉⊔⟨⌉∇↕⊣\⌈∫⇔N≀∇⊒⊣†⇔
S⟩\}⊣√≀∇⌉⇔S⊒⟩⊔‡⌉∇↕⊣\⌈⇔⊔⟨⌉U\⟩⊔⌉⌈K⟩\}⌈≀⇕⇔⊔⟨⌉U\⟩⊔⌉⌈S⊔⊣⊔⌉∫⇔V⟩⌉⊔\⊣⇕⇔Z⊣⟩∇⌉
O\⊔⟨⌉7⊔⟨M⊣†H≀\≀↕⊓↕⊓⊒⟩↕↕√∇≀⊑⟩⌈⌉⊔⟨⌉⌊⊣⌋∥⌈∇≀√≀{⊔⟨⌉⌉↕⌉⊑⌉\⊔⟨
I\⊔⌉∇\⊣⊔⟩≀\⊣↕W≀∇↕⌈W⟩⌈⌉W⌉⌊C≀\{⌉∇⌉\⌋⌉↙
S√⌉⊣∥⌉∇∫⌋≀\{⟩∇⇕⌉⌈
T⟩⇕B⌉∇\⌉∇∫↖L⌉⌉¬T⟩⇕⟩∫⊔⟨⌉⊒⌉↕↕∥\≀⊒\⟩\⊑⌉\⊔≀∇≀{⊔⟨⌉W⌉⌊⇔
I⊣\F≀∫⊔⌉∇¬I⊣\⟩∫⊔⟨⌉√⟩≀\⌉⌉∇≀{⊔⟨⌉G∇⟩⌈⇔⊔⟨⌉\⌉§⊔}⌉\⌉∇⊣⊔⟩≀\⟩\⊔⌉∇\⌉⊔↙
Obviously, there is not much the computer understands, and as a consequence, there is not a lot
the computer can support the reader with. So we have to “help” the computer by providing some
meaning. Conventional wisdom is that we add some semantic/functional markup. Here we pick
XML without loss of generality, and characterize some fragments of text e.g. as dates.
308 CHAPTER 18. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB
ℜ⊔⟩⊔↕⌉⊤WWW∈′′∈
T⟨⌉⌉↕⌉⊑⌉\⊔⟨I\⊔⌉∇\⊣⊔⟩≀\⊣↕W≀∇↕⌈W⟩⌈⌉W⌉⌊C≀\{⌉∇⌉\⌋⌉ℜ∝⊔⟩⊔↕⌉⊤
ℜ√↕⊣⌋⌉⊤S⟨⌉∇⊣⊔≀\W⊣⟩∥⟩∥⟩H≀⊔⌉↕H≀\≀↕⊓↕⊓⇔H⊣⊒⊣⟩⟩⇔USAℜ∝√↕⊣⌋⌉⊤
ℜ⌈⊣⊔⌉⊤7↖∞∞M⊣†∈′′∈ℜ∝⌈⊣⊔⌉⊤
parse 7∞∞M⊣†∈′′∈ as the date May 7 11 2002 and add this to the user’s calendar,
parse S⟨⌉∇⊣⊔≀\W⊣⟩∥⟩∥⟩H≀⊔⌉↕H≀\≀↕⊓↕⊓⇔H⊣⊒⊣⟩⟩⇔USA as a destination and find flights.
But: do not be deceived by your ability to understand English!
To understand what a machine can understand we have to obfuscate the markup as well, since it
does not carry any intrinsic meaning to the machine either.
<√↕⊣⌋⌉>S⟨⌉∇⊣⊔≀\W⊣⟩∥⟩∥⟩H≀⊔⌉↕H≀\≀↕⊓↕⊓⇔H⊣⊒⊣⟩⟩⇔USA</√↕⊣⌋⌉>
<⌈⊣⊔⌉>7↖∞∞M⊣†∈′′∈</⌈⊣⊔⌉>
<√⊣∇⊔⟩⌋⟩√⊣\⊔∫ >R⌉}⟩∫⊔⌉∇⌉⌈√⊣∇⊔⟩⌋⟩√⊣\⊔∫⌋≀⇕⟩\}{∇≀⇕
A⊓∫⊔∇⊣↕⟩⊣⇔C⊣\⊣⌈⊣⇔C⟨⟩↕⌉D⌉\⇕⊣∇∥⇔F∇⊣\⌋⌉⇔G⌉∇⇕⊣\†⇔G⟨⊣\⊣⇔H≀\}K≀\}⇔I\⌈⟩⊣⇔
I∇⌉↕⊣\⌈⇔I⊔⊣↕†⇔J⊣√⊣\⇔M⊣↕⊔⊣⇔N⌉⊒Z⌉⊣↕⊣\⌈⇔T⟨⌉N⌉⊔⟨⌉∇↕⊣\⌈∫⇔N≀∇⊒⊣†⇔
S⟩\}⊣√≀∇⌉⇔S⊒⟩⊔‡⌉∇↕⊣\⌈⇔⊔⟨⌉U\⟩⊔⌉⌈K⟩\}⌈≀⇕⇔⊔⟨⌉U\⟩⊔⌉⌈S⊔⊣⊔⌉∫⇔V⟩⌉⊔\⊣⇕⇔Z⊣⟩∇⌉
</√⊣∇⊔⟩⌋⟩√⊣\⊔∫ >
<⟩\⊔∇≀⌈⊓⌋⊔⟩≀\>O\⊔⟨⌉7⊔⟨M⊣†H≀\≀↕⊓↕⊓⊒⟩↕↕√∇≀⊑⟩⌈⌉⊔⟨⌉⌊⊣⌋∥⌈∇≀√≀{⊔⟨⌉⌉↕⌉⊑⌉\⊔⟨I\⊔⌉∇\⊣↖
⊔⟩≀\⊣↕W≀∇↕⌈W⟩⌈⌉W⌉⌊C≀\{⌉∇⌉\⌋⌉↙</⟩\⊔∇≀⌈⊓⌋⊔⟩≀\>
<√∇≀}∇⊣⇕>S√⌉⊣∥⌉∇∫⌋≀\{⟩∇⇕⌉⌈
<∫√⌉⊣∥⌉∇>T⟩⇕B⌉∇\⌉∇∫↖L⌉⌉¬T⟩⇕⟩∫⊔⟨⌉⊒⌉↕↕∥\≀⊒\⟩\⊑⌉\⊔≀∇≀{⊔⟨⌉W⌉⌊</∫√⌉⊣∥⌉∇>
<∫√⌉⊣∥⌉∇>I⊣\F≀∫⊔⌉∇¬I⊣\⟩∫⊔⟨⌉√⟩≀\⌉⌉∇≀{⊔⟨⌉G∇⟩⌈⇔⊔⟨⌉\⌉§⊔}⌉\⌉∇⊣⊔⟩≀\⟩\⊔⌉∇\⌉⊔<∫√⌉⊣∥⌉∇>
</√∇≀}∇⊣⇕>
So we have not really gained much either with the markup, we really have to give meaning to the
markup as well, this is where techniques from semenatic web come into play.
To understand how we can make the web more semantic, let us first take stock of the current status
of (markup on) the web. It is well-known that world-wide-web is a hypertext, where multimedia
documents (text, images, videos, etc. and their fragments) are connected by hyperlinks. As we
have seen, all of these are largely opaque (non-understandable), so we end up with the following
situation (from the viewpoint of a machine).
Essentially, to make the web more machine-processable, we need to classify the resources by the
concepts they represent and give the links a meaning in a way, that we can do inference with that.
The ideas presented here gave rise to a set of technologies jointly called the “semantic web”, which
we will now summarize before we return to our logical investigations of knowledge representation
techniques.
Example 18.1.24. getting your hair cut (at tell receptionist you’re here
a beauty parlor)
Beautician cuts hair
props, actors as “script variables”
pay
events in a (generalized) sequence
happy unhappy
use script material for
big tip small tip
anaphora, bridging references
default common ground
to fill in missing material into situations
But of course logic-based approaches have big drawbacks as well. The first is that we have to obtain
the symbolic representations of knowledge to do anything – a non-trivial challenge, since most
knowledge does not exist in this form in the wild, to obtain it, some agent has to experience the
word, pass it through its cognitive apparatus, conceptualize the phenomena involved, systematize
them sufficiently to form symbols, and then represent those in the respective formalism at hand.
The second drawback is that the process models induced by logic-based approaches (inference
with calculi) are quite intractable. We will see that all inferences can be played back to satisfiability
tests in the underlying logical system, which are exponential at best, and undecidable or even
incomplete at worst.
Therefore a major thrust in logic-based knowledge representation is to investigate logical sys-
tems that are expressive enough to be able to represent most knowledge, but still have a decidable
– and maybe even tractable in practice – satisfiability problem. Such logics are called “description
logics”. We will study the basics of such logical systems and their inference procedures in the
following.
18.2. LOGIC-BASED KNOWLEDGE REPRESENTATION 313
L::=C | ⊤ | ⊥ | L | L ⊓ L | L ⊔ L | L ⊑ L | L ≡ L
Note: ⟨PL0DL , S, [ ·]]⟩, where S is the class of possible domains forms a logical
system.
The main use of the set-theoretic semantics for PL0 is that we can use it to give meaning to concept
axioms, which we use to describe the “world”.
Concept Axioms
Observation: Set-theoretic semantics of ‘true’ and ‘false’(⊤:=φ ⊔ φ ⊥:=φ ⊓ φ)
Idea: Use logical axioms to describe the world (Axioms restrict the class of
admissible domain structures)
314 CHAPTER 18. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB
sons daughters
child
daughter
son
children
Concept axioms are used to restrict the set of admissible domains to the intended ones. In our
situation, we require them to be true – as usual – which here means that they denote the whole
domain D.
Let us fortify our intuition about concept axioms with a simple example about the sibling relation.
We give four concept axioms and study their effect on the admissible models by looking at the
respective Venn diagrams. In the end we see that in all admissible models, the denotations of the
concepts son and daughter are disjoint, and child is the union of the two – just as intended.
Axioms Semantics
son ⊑ child
iff [ son]] ∪ [ child]] = D
iff [ son]] ⊆ [ child]]
sons daughters
daughter
⊑child
iff daughter ∪ [ child]] = D
iff [ daughter]] ⊆ [ child]] children
The set-theoretic semantics introduced above is compatible with the regular semantics of proposi-
tional logic, therefore we have the same propositional identities. Their validity can be established
directly from the settings in Definition 18.2.2.
18.2. LOGIC-BASED KNOWLEDGE REPRESENTATION 315
Propositional Identities
Name for ⊓ for ⊔
Idenpot. φ⊓φ=φ φ⊔φ=φ
Identity φ⊓⊤=φ φ⊔⊥=φ
Absorpt. φ⊔⊤=⊤ φ⊓⊥=⊥
Commut. φ⊓ψ =ψ⊓φ φ⊔ψ =ψ⊔φ
Assoc. φ⊓ψ⊓θ =φ⊓ψ⊓θ φ⊔ψ⊔θ =φ⊔ψ⊔θ
Distrib. φ ⊓ (ψ ⊔ θ) = φ ⊓ ψ ⊔ φ ⊓ θ φ ⊔ ψ ⊓ θ = (φ ⊔ ψ) ⊓ (φ ⊔ θ)
Absorpt. φ ⊓ (φ ⊔ θ) = φ φ⊔φ⊓θ =φ⊓θ
Morgan φ⊓ψ =φ⊔ψ φ⊔ψ =φ⊓ψ
dneg φ=φ
There is another way we can approach the set description interpretation of propositional logic: by
translation into a logic that can express knowledge about sets – first-order logic.
Definition Comment
pfo(x) :=p(x)
fo(x) fo(x)
A :=¬A
fo(x) fo(x) fo(x)
A⊓B :=A ∧B ∧ vs. ⊓
fo(x) fo(x) fo(x)
A⊔B :=A ∨B ∨ vs. ⊔
fo(x) fo(x) fo(x)
A⊑B :=A ⇒B ⇒ vs. ⊑
fo(x) fo(x) fo(x)
A=B :=A ⇔B ⇔ vs. =
fo fo(x)
A :=(∀x A ) for formulae
Translation Examples
Example 18.2.8. We translate the concept axioms from Example 18.2.6 to fortify
our intuition:
fo
son ⊑ child = ∀x son(x) ⇒ child(x)
fo
daughter ⊑ child = ∀x daughter(x) ⇒ child(x)
fo
son ⊓ daughter = ∀x son(x) ∧ daughter(x)
fo
child ⊑ son ⊔ daughter = ∀x child(x) ⇒ (son(x) ∨ daughter(x))
As we will see, the situation for PL0DL is typical for formal ontologies (even though it only offers
concepts), so we state the general description logic paradigm for ontologies. The important idea
is that having a formal system as an ontology format allows us to capture, study, and implement
ontological inference.
For convenience we add concept definitions as a mechanism for defining new concepts from old
ones. The so-defined concepts inherit the properties from the concepts they are defined from.
As PL0DL does not offer any guidance on this, we will leave the discussion of ABoxes to subsec-
tion 18.3.3 when we have introduced our first proper description logic ALC.
Consistency Test
Example 18.2.24 (T-Box).
Even though consistency in our example seems trivial, large ontologies can make machine support
necessary. This is even more true for ontologies that change over time. Say that an ontology
initially has the concept definitions woman=person⊓long_hair and man=person⊓bearded, and then
is modernized to a more biologically correct state. In the initial version the concept hermaphrodite
is consistent, but becomes inconsistent after the renovation; the authors of the renovation should
be made aware of this by the system.
The subsumption test determines whether the sets denoted by two concepts are in a subset relation.
The main justification for this is that humans tend to be aware of concept subsumption, and tend
to think in taxonomytaxonomic hierarchies. To cater to this, the subsumption test is useful.
Subsumption Test
Example 18.2.25. in this case trivial
The good news is that we can reduce the subsumption test to the consistency test, so we can
re-use our existing implementation.
The main user-visible service of the subsumption test is to compute the actual taxonomy induced
by an ontology.
Classification
The subsumption relation among all concepts (subsumption graph)
Visualization of the subsumption graph for inspection (plausibility)
Definition 18.2.27. Classification is the computation of the subsumption graph.
object
person
If we take stock of what we have developed so far, then we can see PL0DL as a rational recon-
struction of semantic networks restricted to the “isa” relation. We relegate the “instance” relation
to subsection 18.3.3.
This reconstruction can now be used as a basis on which we can extend the expressivity and
inference procedures without running into problems.
ALC extends the concept operators of PL0DL with binary relations (called “roles” in ALC). This
gives ALC the expressive power we had for the basic semantic networks from ??.
18.3. A SIMPLE DESCRIPTION LOGIC: ALC 321
Syntax of ALC
Example 18.3.4. person, woman, man, mother, professor, student, car, BMW,
computer, computer program, heart attack risk, furniture, table, leg of a chair, . . .
Definition 18.3.5. Roles name binary relations (like in PL1 )
Example 18.3.6. has_child, has_son, has_daughter, loves, hates, gives_course,
executes_computer_program, has_leg_of_table, has_wheel, has_motor, . . .
ALC restricts the quantifications to range all individuals reachable as role successors. The dis-
tinction between universal and existential quantifiers clarifies an implicit ambiguity in semantic
networks.
Definition 18.3.7 (Grammar). FALC ::=C | ⊤ | ⊥ | FALC | FALC ⊓ FALC | FALC ⊔ FALC |
∃R FALC | ∀R FALC
Example 18.3.8.
Example 18.3.9. car ⊓ ∃has_part ∃made_in EU (cars that have at least one part
that has not been made in the EU)
As before we allow concept definitions so that we can express new concepts from old ones, and
obtain more concise descriptions.
Definition rec?
man = person ⊓ ∃has_chrom Y_chrom -
woman = person ⊓ ∀has_chrom Y_chrom -
mother = woman ⊓ ∃has_child person -
father = man ⊓ ∃has_child person -
grandparent = person ⊓ ∃has_child (mother ⊔ father) -
german = person ⊓ ∃has_parents german +
number_list = empty_list ⊔ ∃is_first number ⊓ ∃is_rest number_list +
As before, we can normalize a TBox by definition expansion if it is acyclic. With the introduction
of roles and quantification, concept definitions in ALC have a more “interesting” way to be cyclic
as Observation 18.3.19 shows.
Example 18.3.20.
Now that we have motivated and fixed the syntax of ALC, we will give it a formal semantics.
The semantics of ALC is an extension of the set-theoretic semantics for PL0 , thus the interpretation
[[·]] assigns subsets of the domain to concepts and binary relations over the domain to roles.
Semantics of ALC
ALC semantics is an extension of the set-semantics of propositional logic.
Definition 18.3.21. A model for ALC is a pair ⟨D, [[·]]⟩, where D is a non-empty
set called the domain and [[·]] a mapping called the interpretation, such that
We can now use the ALC identities above to establish a useful normal form for ALC. This will
play a role in the inference procedures we study next.
The following identitieswill be useful later on. They can be proven directly with the settings from
Definition 18.3.21; we carry this out for one of them below.
324 CHAPTER 18. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB
ALC Identities
1 ∃R φ = ∀R φ 3 ∀R φ = ∃R φ
2 ∀R (φ ⊓ ψ) = ∀R φ ⊓ ∀R ψ 4 ∃R (φ ⊔ ψ) = ∃R φ ⊔ ∃R ψ
Proof of 1
∃R φ = D\ [ ∃R φ]] = D\{x ∈ D | ∃y (⟨x, y⟩∈ [ R]]) and (y∈ [ φ]])}
= {x ∈ D | not ∃y (⟨x, y⟩∈ [ R]]) and (y∈ [ φ]])}
= {x ∈ D | ∀y if (⟨x, y⟩∈ [ R]]) then (y̸∈ [ φ]])}
= {x ∈ D | ∀y if (⟨x, y⟩∈ [ R]]) then (y∈(D\ [ φ]]))}
= {x ∈ D | ∀y if (⟨x, y⟩∈ [ R]]) then (y∈ [ φ]])}
= [ ∀R φ]]
The form of the identities (interchanging quantification with connectives) is reminiscient of identi-
ties in PL1 ; this is no coincidence as the “semantics by translation” of Definition 18.3.22 shows.
Use the ALC identities as rules to compute it. (in linear time)
example by rule
∃R (∀S e ⊓ ∀S d)
7→ ∀R ∀S e ⊓ ∀S d ∃R φ 7→ ∀R φ
7→ ∀R (∀S e ⊔ ∀S d) φ ⊓ ψ 7→ φ ⊔ ψ
7→ ∀R (∃S e ⊔ ∀S d) ∀R φ 7→ ∀R φ
7→ ∀R (∃S e ⊔ ∀S d) φ 7→ φ
Finally, we extend ALC with an ABox component. This mainly means that we define two new
assertions in ALC and specify their semantics and PL1 translation.
If we take stock of what we have developed so far, then we see that ALC as a rational recon-
struction of semantic networks restricted to the “isa” and “instance” relations – which are the only
ones that can really be given a denotational and operational semantics.
where φ is a normalized ALC concept in negation normal form with the following
rules:
x:c x:∀R φ
x:c x:φ ⊓ ψ x:φ ⊔ ψ xRy x:∃R φ
T⊥ T⊓ T⊔ T∀ T∃
⊥ x:φ y:φ xRy
x:φ x:ψ
x:ψ y:φ
In contrast to the tableau calculi for theorem proving we have studied earlier, TALC is run in “model
generation mode”. Instead of initializing the tableau with the axioms and the negated conjecture
and hope that all branches will close, we initialize the TALC tableau with axioms and the “conjecture”
that a given concept φ is satisfiable – i.e. φ h as a member x, and hope for branches that are
open, i.e. that make the conjecture true (and at the same time give a model).
Let us now work through two very simple examples; one unsatisfiable, and a satisfiable one.
326 CHAPTER 18. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB
TALC Examples
Example 18.3.29. We have two similar conjectures about children.
x:∀has_child man ⊓ ∃has_child man (all sons, but a daughter)
x:∀has_child man ⊓ ∃has_child man (only sons, and at least one)
Example 18.3.30 (Tableau Proof).
1 x:∀has_child man ⊓ ∃has_child man initial x:∀has_child man ⊓ ∃has_child man initial
2 x:∀has_child man T⊓ x:∀has_child man T⊓
3 x:∃has_child man T⊓ x:∃has_child man T⊓
4 x has_child y T∃ x has_child y T∃
5 y:man T∃ y:man T∃
6 y:man T∀ open
7 ⊥ T⊥
inconsistent
The right tableau has a model: there are two persons, x and y. y is the only child
of x, y is a man
Another example: this one is more complex, but the concept is satisfiable.
7 y:ugrad y:grad T⊔
8 ⊥ open
The left branch is closed, the right one represents a model: y is a child of x, y
is a graduate student, x hat exactly one child: y.
After we got an intuition about TALC , we can now study the properties of the calculus to determine
that it is a decision procedure for ALC.
The correctness result for TALC is as usual: we start with a model of x:φ and show that an TALC
tableau must have an open branch.
Correctness
Lemma 18.3.32. If φ satisfiable, then TALC terminates on x:φ with open branch.
Proof: Let M:=⟨D, [ ·]]⟩ be a model for φ and w∈ [ φ]].
I|=(x:ψ) iff [[x]] ∈ [[ψ]]
1. We define [ x]] :=w and I|=x R y iff ⟨x, y⟩∈ [[R]]
I|=S iff I|=c for all c∈S
2. This gives us M|=(x:φ) (base case)
3. If the branch is satisfiable, then either
no rule applicable to leaf, (open branch)
or rule applicable and one new branch satisfiable. (inductive case)
4. There must be an open branch. (by termination)
We complete the proof by looking at all the TALC inference rules in turn.
For the completeness result for TALC we have to start with an open tableau branch and construct at
328 CHAPTER 18. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB
model that satisfies all judgements in the branch. We proceed by building a Herbrand model, whose
domain consists of all the individuals mentioned in the branch and which interprets all concepts
and roles as specified in the branch. Not surprisingly, the model thus constructed satisfies the
branch.
D : = {x|x:ψ∈B or z R x∈B}
[ c]] : = {x|x:c∈B}
[ R]] : = {⟨x, y⟩|x R y∈B}
We complete the proof by looking at all the TALC inference rules in turn.
case y:ψ = y:ψ 1 ⊓ ψ 2 Then {y:ψ 1 , y:ψ 2 } ⊆ B (T⊓ -rule, saturation) so M|=(y:ψ 1 )
and M|=(y:ψ 2 ) and M|=(y:ψ 1 ⊓ ψ 2 ) (IH, Definition)
case y:ψ = y:ψ 1 ⊔ ψ 2 Then y:ψ 1 ∈B or y:ψ 2 ∈B (T⊔ , saturation) so M|=(y:ψ 1 ) or
M|=(y:ψ 2 ) and M|=(y:ψ 1 ⊔ ψ 2 ) (IH, Definition)
case y:ψ = y:∃R θ then {y R z, z:θ} ⊆ B (z new variable) (T∃ -rules, saturation) so
M|=(z:θ) and M|=y R z, thus M|=(y:∃R θ). (IH, Definition)
case y:ψ = y:∀R θ Let ⟨ [ y]] , v⟩∈ [ R]] for some r∈D
then v = z for some variable z with y R z∈B (construction of [ R]]) So z:θ∈B and
M|=(z:θ). (T∀ -rule, saturation, Def) Since v was arbitrary we have M|=(y:∀R θ).
Termination
Theorem 18.3.34. TALC terminates
To prove termination of a tableau algorithm, find a well-founded measure (function)
18.3. A SIMPLE DESCRIPTION LOGIC: ALC 329
We can turn the termination result into a worst-case complexity result by examining the sizes of
branches.
Complexity
Idea: Work of tableau branches one after the other. (Branch size =
b space
complexity)
Observation 18.3.35. The size of the branches is polynomial in the size of the
input formula:
Proof sketch: Re-examine the termination proof and count: the first summand
comes from Proof step 4., the second one from Proof step 3. and Proof step 2.
Theorem 18.3.36. The satisfiability problem for ALC is in PSPACE.
Theorem 18.3.37. The satisfiability problem for ALC is PSPACE-Complete.
Proof sketch: Reduce a PSPACE-complete problem to ALC-satisfiability
In summary, the theoretical complexity of ALC is the same as that for PL0 , but in practice ALC is
much more expressive. So this is a clear win.
But the description of the tableau algorithm TALC is still quite abstract, so we look at an
exemplary implementation in a functional programmingfunctional programming language.
Note that we have (so far) only considered an empty TBox: we have initialized the tableau
with a normalized concept; so we did not need to include the concept definitions. To cover “real”
ontologies, we need to consider the case of concept axioms as well.
We now extend TALC with concept axioms. The key idea here is to realize that the concept axioms
apply to all individuals. As the individuals are generated by the T∃ rule, we can simply extend
that rule to apply all the concepts axioms to the newly introduced individual.
The problem of this approach is that it spoils termination, since we cannot control the number of
rule applications by (fixed) properties of the input formulae. The example shows this very nicely.
18.3. A SIMPLE DESCRIPTION LOGIC: ALC 331
x:d start
x:∃R c in CA Solution: Loop-Check:
x R y1 T∃
Instead of a new variable y take an old
y 1 :c T∃
variable z, if we can guarantee that what-
y 1 :∃R c TC∃A
ever holds for y already holds for z.
y1 R y2 T∃
y 2 :c T∃ We can only do this, iff the T∀ -rule has
y 2 :∃R c TC∃A been exhaustively applied.
...
Theorem 18.3.40. The consistency problem of ALC with concept axioms is decid-
able.
Proof sketch: TALC with a suitable loop check terminates.
If we combine classification with the instance test, then we get the full picture of how concepts
and individuals relate to each other. We see that we get the full expressivity of semantic networks
in ALC.
332 CHAPTER 18. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB
Realization
Definition 18.3.43. Realization is the computation of all instance relations be-
tween ABox objects and TBox concepts.
object
person
Let us now get an intuition on what kinds of interactions between the various parts of an ontology.
property example
internally inconsistent tony:student, tony:student
TBox: student ⊓ prof
inconsistent with a TBox
ABox: tony:student, tony:prof
ABox: tony:∀has_grad genius
implicit info that is not explicit tony has_grad mary
|= mary:genius
TBox: happy_prof = prof ⊓ ∀has_grad genius
ABox: tony:happy_prof,
information that can be com-
tony has_grad mary
bined with TBox info
|= mary:genius
This completes our investigation of inference for ALC. We summarize that ALC is a logic-based
ontology language where the inference problems are all decidable/computable via TALC . But of
course, while we have reached the expressivity of basic semantic networks, there are still things
that we cannot express in ALC, so we will try to extend ALC without losing decidability/com-
putability.
Note that all these examples have in common that they are about “objects on the Web”, which is
an aspect we will come to now.
“Objects on the Web” are traditionally called “resources”, rather than defining them by their
intrinsic properties – which would be ambitious and prone to change – we take an external property
to define them: everything that has a URI is a web resource. This has repercussions on the design
of RDF.
The crucial observation here is that if we map “subjects” and “objects” to “individuals”, and
“predicates” to “relations”, the RDF triples are just relational ABox statements of description
logics. As a consequence, the techniques we developed apply.
Note:
Actually, a RDF graph is technically a labeled multigraph, which allows multiple edges between
any two nodes (the resources) and where nodes and edges are labeled by URIs.
We now come to the concrete syntax of RDF. This is a relatively conventional XML syntax that
combines RDF statements with a common subject into a single “description” of that resource.
<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22−rdf−syntax−ns#"
xmlns:dc= "http://purl.org/dc/elements/1.1/">
<rdf:Description about="https://.../CompLog/kr/en/rdf.tex">
<dc:creator>Michael Kohlhase</dc:creator>
<dc:source>http://www.w3schools.com/rdf</dc:source>
</rdf:Description>
</rdf:RDF>
Note that XML namespaces play a crucial role in using element to encode the predicate URIs.
Recall that an element name is a qualified name that consists of a namespace URI and a proper
element name (without a colon character). Concatenating them gives a URI in our example the
predicate URI induced by the dc:creator element is http://purl.org/dc/elements/1.1/creator.
Note that as URIs go RDF URIs do not have to be URLs, but this one is and it references (is
redirected to) the relevant part of the Dublin Core elements specification [DCM12].
RDF was deliberately designed as a standoff markup format, where URIs are used to annotate
web resources by pointing to them, so that it can be used to give information about web resources
without having to change them. But this also creates maintenance problems, since web resources
may change or be deleted without warning.
RDFa gives authors a way to embed RDF triples into web resources and make keeping RDF
statements about them more in sync.
https://svn.kwarc.info/.../CompLog/kr/slides/rdfa.tex
http://purl.org/dc/elements/1.1/title
http://purl.org/dc/elements/1.1/date
http://purl.org/dc/elements/1.1/creator
RDFa as an Inline RDF Markup Format
2009−11−11 (xsd:date)
Michael Kohlhase
336 CHAPTER 18. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB
In the example above, the about and property attribute are reserved by RDFa and specify the
subject and predicate of the RDF statement. The object consists of the body of the element,
unless otherwise specified e.g. by the resource attribute.
Let us now come back to the fact that RDF is just an XML syntax for ABox statements.
In this situation, we want a standardized representation language for TBox information; OWL
does just that: it standardizes a set of knowledge representation primitives and specifies a variety
of concrete syntaxes for them. OWL is designed to be compatible with RDF, so that the two
together can form an ontology language for the web.
Definition 18.4.10. OWL (the ontology web language) is a language for encoding
TBox information about RDF classes.
Example 18.4.11 (A concept definition for “Mother”).
Mother=Woman ⊓ Parent is represented as
But there are also other syntaxes in regular use. We show the functional syntax which is inspired
by the mathematical notation of relations.
18.4. DESCRIPTION LOGICS AND THE SEMANTIC WEB 337
We have introduced the ideas behind using description logics as the basis of a “machine-oriented
web of data”. While the first OWL specification (2004) had three sublanguages “OWL Lite”, “OWL
DL” and “OWL Full”, of which only the middle was based on description logics, with the OWL2
Recommendation from 2009, the foundation in description logics was nearly universally accepted.
The semantic web hype is by now nearly over, the technology has reached the “plateau of
productivity” with many applications being pursued in academia and industry. We will not go
into these, but briefly instroduce one of the tools that make this work.
Example 18.4.15.
Query for person names and their e-mails from a triplestore with FOAF data.
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name ?email
WHERE {
?person a foaf:Person.
?person foaf:name ?name.
?person foaf:mbox ?email.
}
SPARQL end-points can be used to build interesting applications, if fed with the appropriate data.
An interesting – and by now paradigmatic – example is the DBPedia project, which builds a large
ontology by analyzing Wikipedia fact boxes. These are in a standard HTML form which can be
analyzed e.g. by regular expressions, and their entries are essentially already in triple form: The
subject is the Wikipedia page they are on, the predicate is the key, and the object is either the
URI on the object value (if it carries a link) or the value itself.
We conclude our survey of the semantic web technology stack with the notion of a triplestore,
which refers to the database component, which stores vast collections of ABox triples.
341
343
This part covers the AI subfield of “planning”, i.e. search-based problem solving with a
structured representation language for environment state and actions — in planning, the focus is
on the latter.
We first introduce the framework of planning (structured representation languages for problems
and actions) and then present algorithms and complexity results. Finally, we lift some of the
simplifying assumptions – deterministic, fully observable environments – we made in the previous
parts of the course.
344
Chapter 19
Planning I: Framework
Planning
Ambition: Write one program that can solve all classical search problems.
Idea: For CSP, going from “state/action-level search” to “problem-description
level search” did the trick.
Definition 19.0.2. Let Π be a search problem (see chapter 8)
345
346 CHAPTER 19. PLANNING I: FRAMEWORK
Use inference systems to deduce new world knowledge from percepts and actions.
Problem: Representing (changing) percepts immediately leads to contradictions!
Example 19.1.1. If the agent moves and a cell with a draft (a perceived breeze)
is followed by one without.
Let us recall the agent-based setting we were using for the inference procedures from Part III. We
will elaborate this further in this section.
Sensors
State
How the world evolves What the world
is like now
Environment
What my actions do
Agent Actuators
Figure 2.12 A model-based reflex agent. It keeps track of the current state of the world,
Still Unspecified:
using an internal model. It then chooses an action in the same way as the reflex agent. (up next)
MAKE−PERCEPT−SENTENCE:
is responsible for creating the new internalthe
state effects
description.ofThe
percepts.
details of how models and
states are represented vary widely depending on the type of environment and the particular
MAKE−ACTION−QUERY: what
technology used in the agent design. is the
Detailed bestofnext
examples modelsaction?
and updating algorithms
appear in Chapters 4, 12, 11, 15, 17, and 25.
MAKE−ACTION−SENTENCE: the effects
Regardless of the kind of representation ofseldom
used, it is thatpossible
action. for the agent to
determine the current state of a partially observable environment exactly. Instead, the box
In particular,labeled
we “what
will look atisthe
the world effect
like now” of time/change.
(Figure (neglected
2.11) represents the agent’s “best guess” (or so far)
sometimes best guesses). For example, an automated taxi may not be able to see around the
large truck that has stopped in front of it and can only guess about what may be causing the
hold-up. Thus, uncertainty about the current state may be unavoidable, but the agent still has
to make aMichael Kohlhase: Artificial Intelligence 2
decision. 563 2023-02-10
A perhaps less obvious point about the internal “state” maintained by a model-based
agent is that it does not have to describe “what the world is like now” in a literal sense. For
Now that we have the notion of fluents to represent the percepts at a given time point, let us try
to model how they influence the agent’s world model.
Axiom like these model the agent’s sensors – here that they are totally reliable:
there is a breeze, iff the agent feels a draft.
Definition 19.1.5. We call fluents that describe the agent’s sensors sensor axioms.
Problem: Where do fluents like Ag@(t, x, y) come from?
You may have noticed that for the sensor axioms we have only used first-order logic. There is a
general story to tell here: if we have finite domains (as we do in the Wumpus cave) we can always
“compile first-order logic” into propositional logic. We will develop this here before we go on with
the Wumpus models.
348 CHAPTER 19. PLANNING I: FRAMEWORK
We now continue to our logic-based agent models: Now we focus on effect axioms to model the
effects of an agent’s actions.
Definition 19.1.7. Effect axioms describe how the environment change under an
agent’s actions.
Example 19.1.8. If the agent is in cell [1, 1] facing east at time 0 and goes
forwardq, she is in cell [2, 1] and no longer in [1, 1]:
Unfortunately, the percept fluents, sensor axioms, and effect axioms are not enough, as we will
show in ??. We will see that this is a more general problem – the famous frame problem that
19.1. LOGIC-BASED PLANNING 349
Note that OK and glitter are fluents, since the Wumpus might have died or the gold
might have been grabbed.
And finally the route planning part of the code. This is essentially just A∗ search.
Evaluation: Even though this works for the Wumpus world, it is not the “universal,
logic-based problem solver” we dreamed of!
Planning tries to solve this with another representation of actions. (up next)
{rabbit(r1)} {white(r1)}
rabbit
Approval: Decide CQ
Approval:
not Decide CQ
not
Necessary Approval
Create CQ Necessary Approval
Create CQ
Submit CQ
Submit CQ
Check CQ Check CQ
Check CQ Check CQ
Completeness Consistency
Completeness Consistency
Mark CQ as
Mark CQ as
Accepted
Accepted
Create Follow-
Create Follow-
Up for CQ
Up for CQ
Check CQ
Check CQ
Approval
Approval
Status
Status Archive CQ
Archive CQ
DMZ
Web Server Application Server
Internet
Router
Firewall
Attacker
Workstation
DB Server
SENSITIVE USERS
DMZ
Web Server Application Server
Internet
Router
Firewall
Attacker
Workstation
DB Server
SENSITIVE USERS
19.2. PLANNING: INTRODUCTION 353
DMZ
Web Server Application Server
Internet
Router
Firewall
Attacker
Workstation
DB Server
SENSITIVE USERS
DMZ
Web Server Application Server
Internet
Router
Firewall
Attacker
Workstation
DB Server
SENSITIVE USERS
Quick: Rapid prototyping: 10s lines of problem description vs. 1000s lines of C++
code. (E.g. language generation)
Flexible: Adapt/maintain the description. (E.g. network security)
Intelligent: Determines automatically how to solve a complex problem effectively!
(The ultimate goal, no?!)
Efficiency loss: Without any domain-specific knowledge about chess, you don’t
beat Kasparov . . .
Trade-off between “automatic and general” vs. “manual work but effective”.
Research Question: How to make fully automatic algorithms effective?
Search Planning
States Lisp data structures Logical sentences
Actions Lisp code Preconditions/outcomes
Goal Lisp code Logical sentence (conjunction)
Plan Sequence from S0 Constraints on actions
n blocks, 1 hand.
A single action either takes a block with the hand or puts a
block we’re holding onto some other block/the table.
Observation 19.2.4. State spaces typically are huge even for simple problems.
In other words: Even solving “simple problems” automatically (without help from
a human) requires a form of intelligence.
With blind search, even the largest super computer in the world won’t scale beyond
20 blocks!
Focussing on heuristic search as the solution method, this is the main question
that needs to be answered.
SAT variables at(A)0 , at(B)0 , move(A, B)0 , move(A, C)0 , at(A)1 , at(B)1 ;
F T
clauses to encode transition behavior e.g. at(B)1 ∨move(A, B)0 ; unit clauses
T T T
to encode initial state at(A)0 , at(B)0 ; unit clauses to encode goal at(B)1 .
Popular when: 1996 – today.
Approach: From planning task description, generate propositional CNF formula
φk that is satisfiable iff there exists a plan with k steps; use a SAT solver on φk ,
for different values of k.
Keywords/cites: [KS92; KS98; RHN06; Rin10], SAT encoding schemes, Black-
Box, . . .
Prerequisite/Result:
Standard representation language: PDDL [McD+98; FL03; HE05; Ger+09]
Problem Corpus: ≈ 50 domains, ≫ 1000 instances, 74 (!!) planners in 2011
y?
Answer: reserved for the plenary sessions ; be there!
Generally: reserved for the plenary sessions ; be there!
A
C B
A B C
Simple planners that split the goal into subgoals on(A, B) and on(B, C) fail:
STRIPS Planning
Definition 19.4.1. STRIPS = Stanford Research Institute Problem Solver.
STRIPS is the simplest possible (reasonably expressive) logics based planning
language.
“TSP” in Australia
Example 19.4.3 (Salesman Travelling in Australia).
at(Ad)
at(Br) drv(Br, Sy) at(Sy) drv(Sy, Ad) vis(Sy)
vis(Sy) vis(Sy)
vis(Br)
vis(Br) vis(Br)
vis(Ad)
drv(Ad, Sy)
drv(Sy, Br)
at(Sy)
at(Sy) vis(Sy)
vis(Sy) vis(Ad)
vis(Br)
drv(Sy, Ad)
drv(Br, Sy)
at(Br)
at(Ad) at(Sy)
vis(Sy)
vis(Sy) vis(Sy)
drv(Ad, Sy) drv(Sy, Br) vis(Ad)
vis(Ad) vis(Ad)
vis(Br)
Answer: Yes, two – plans for TSP− are solutions for ΘTSP− , dashed node =
b I,
thick nodes =
b G:
drv(Sy, Br), drv(Br, Sy), drv(Sy, Ad) (upper path)
drv(Sy, Ad), drv(Ad, Sy), drv(Sy, Br). (lower path)
The Blocksworld
Definition 19.4.7. The blocks world is a simple planning domain: a set of wooden
blocks of various shapes and colors sit on a table. The goal is to build one or more
vertical stacks of blocks. Only one block may be moved at a time: it may either be
placed on the table or placed atop another block.
Example 19.4.8.
E
D C B
E A B C A D
The next example for a planning problem is not obvious at first sight, but has been quite influential,
showing that many industry problems can be specified declaratively by formalizing the domain
and the particular planning problems in PDDL and then using off-the-shelf planners to solve them.
[KS00] reports that this has significantly reduced labor costs and increased maintainability of the
implementation.
VIP D
NA: Never-alone
AT: Attendant.
AT
P: Normal passenger
A
C B
A B C
Simple planners that split the goal into subgoals on(A, B) and on(B, C) fail:
Before we go into the details, let us try to understand the main ideas of partial order planning.
We now make the ideas discussed above concrete by giving a mathematical formulation. It is
advantageous to cast a partially ordered plan as a labeled DAG rather than a partial ordering
since it draws the attention to the difference between actions and steps.
366 CHAPTER 19. PLANNING I: FRAMEWORK
A non-empty set p ⊆ P of facts that are effects of the action of S and the
preconditions of that of T . We call such a labeled edge a causal link and write
p
it S −→ T.
≺, then call it a temporal constraint and write it as S ≺ T .
An open condition is a precondition of a step not yet causally linked.
Definition 19.5.4. Let be a partially ordered plan Π, then we call a step U possibly
p
intervening in a causal link S −→ T , iff Π ∪ {S ≺ U , U ≺ T } is acyclic.
Definition 19.5.5. A precondition is achieved iff it is the effect of an earlier step
and no possibly intervening step undoes it.
Definition 19.5.6. A partially ordered plan Π is called complete iff every precon-
dition is achieved.
Definition 19.5.7. Partial order planning is the process of computing complete
and acyclic partially ordered plans for a given planning task.
Actions: Buy(x)
At(p) Sells(p, x)
Preconditions: At(p), Sells(p, x)
Buy(x)
Effects: Have(x) Have(x)
p
Notation: A causal link S −→ T can also be denoted by a direct arrow between the
effects p of S and the preconditions p of T in the STRIPS action notation above.
Show temporal constraints as dashed arrows.
Planning Process
Definition 19.5.9. Partial order planning is search in the space of partial plans via
the following operations:
add link from an existing action to an open precondition,
add step (an action with links to other steps) to fulfil an open condition,
order one step wrt. another to remove possible conflicts.
Idea: Gradually move from incomplete/vague plans to complete, correct plans.
Backtrack if an open condition is unachievable or if a conflict is unresolvable.
p
Definition 19.5.11. If C clobbers S −→ T in a partially ordered plan Π, then we
can solve the induced conflict by
demotion: add a temporal constraint C ≺ S to Π, or
promotion: add T ≺ C to Π.
Go(SM )
At(SM )
demotion =
b put before
Go(Home)
At(Home)
At(SM ) promotion =
b put after
Buy(M ilk)
Properties of POP
Nondeterministic algorithm: backtracks at choice points on failure:
choice of Sadd to achieve Sneed ,
choice of demotion or promotion for clobberer,
selection of Sneed is irrevocable.
Observation 19.5.15. POP is sound, complete, and systematic i.e. no repetition
There are extensions for disjunction, universals, negation, conditionals.
It can be made efficient with good heuristics derived from problem description.
Particularly good for problems with many loosely related subgoals.
Start
On(C, A) On(A, T ) Cl(B) On(B, T ) Cl(C)
Cl(C) Refining
Refining
InitializingMtheRefining
ove(B,
ove(A, for
partial
AB)
C)for
clobbers
for
the
the
orderthe
totally subgoal
subgoal
subgoal
plan
ordered
withplan.
Cl(B)
Cl(C)ONwith
;(A,
On(B, demote.
Start
Cl(A).C). and Finish.
C).
M ove(C, T )
Cl(A) On(C, T )
Cl(B) Cl(C)
M ove(B, C)
¬Cl(C), On(B, C)
Cl(A)Cl(A) Cl(B)
M ove(A, B)
¬Cl(B) On(A, B)
E
D C B
E A B C A D
E
D C B
E A B C A D
19.7 Conclusion
A Video Nugget covering this section can be found at https://fau.tv/clip/id/26900.
Summary
General problem solving attempts to develop solvers that perform well across a large
class of problems.
Suggested Reading:
• Chapters 10: Classical Planning and 11: Planning and Acting in the Real World in [RN09].
– Although the book is named “A Modern Approach”, the planning section was written long
before the IPC was even dreamt of, before PDDL was conceived, and several years before
heuristic search hit the scene. As such, what we have right now is the attempt of two outsiders
trying in vain to catch up with the dramatic changes in planning since 1995.
– Chapter 10 is Ok as a background read. Some issues are, imho, misrepresented, and it’s far
from being an up-to-date account. But it’s Ok to get some additional intuitions in words
different from my own.
– Chapter 11 is useful in our context here because we don’t cover any of it. If you’re interested
in extended/alternative planning paradigms, do read it.
• A good source for modern information (some of which we covered in the lecture) is Jörg
Hoffmann’s Everything You Always Wanted to Know About Planning (But Were Afraid to
Ask) [Hof11] which is available online at http://fai.cs.uni-saarland.de/hoffmann/papers/
ki11.pdf
Chapter 20
20.1 Introduction
A Video Nugget covering this section can be found at https://fau.tv/clip/id/26901.
(3,3,1)
03/23
In planning, this is referred to as forward search, or forward state-space search.
377
Kohlhase: Künstliche Intelligenz 1 532 July 5, 2018
378 CHAPTER 20. PLANNING II: ALGORITHMS
cos
t es
tim
ate
h
cost esti
mate h goal
init
mate h
cost esti
h
ate
e s tim
t
cos
Heuristic function h estimates the cost of an optimal path from a state s to the
goal; search prefers to expand states s with small h(s).
Live Demo vs. Breadth-First Search:
http://qiao.github.io/PathFinding.js/visual/
Exactly like our definition from chapter 8. Except, because we assume unit costs
here, we use N instead of R+ .
Definition 20.1.2. Let Π be a STRIPS task with states S. The perfect heuristic
h∗ assigns every s∈S the length of a shortest path from s to a goal state, or ∞ if
no such path exists. A heuristic function h for Π is admissible if, for all s∈S, we
have h(s)≤h∗ (s).
Exactly like our definition from chapter 8, except for path length instead of path
cost (cf. above).
In all cases, we attempt to approximate h∗ (s), the length of an optimal plan for s.
Some algorithms guarantee to lower bound h∗ (s).
The delete relaxation is the most successful method for the automatic generation
of heuristic functions. It is a key ingredient to almost all IPC winners of the last
decade. It relaxes STRIPS tasks by ignoring the delete lists.
The h+ Heuristic: What is the resulting heuristic function?
Relaxation in Route-Finding
We will start with a very simple relaxation, which could be termed “positive thinking”: we do not
consider preconditions of actions and leave out the delete lists as well.
R h∗ P ′
R(Πs )
Real problem:
Initial state I: AC; goal G: AD.
Actions A: pre, add, del.
drXY, loX, ulX.
Relaxed problem:
State s: AC; goal G: AD.
Actions A: add.
hR (s) =1: ⟨ulD⟩.
Real problem:
State s: BC; goal G: AD.
Actions A: pre, add, del.
drAB
AC −−−−→ BC.
Relaxed problem:
State s: BC; goal G: AD.
Actions A: add.
hR (s) =2: ⟨drBA, ulD⟩.
Real problem:
State s: CC; goal G: AD.
Actions A: pre, add, del.
drBC
BC −−−−→ CC.
Relaxed problem:
State s: CC; goal G: AD.
Actions A: add.
hR (s) =2: ⟨drBA, ulD⟩.
Real problem:
State s: AC; goal G: AD.
Actions A: pre, add, del.
drBA
BC −−−−→ AC.
Real problem:
State s: AC; goal G: AD.
Actions A: pre, add, del.
Duplicate state, prune.
Real problem:
State s: DC; goal G: AD.
Actions A: pre, add, del.
drCD
CC −−−−→ DC.
Relaxed problem:
State s: DC; goal G: AD.
Actions A: add.
hR (s) =2: ⟨drBA, ulD⟩.
Real problem:
State s: CT ; goal G: AD.
Actions A: pre, add, del.
loC
CC −−→ CT .
Relaxed problem:
State s: CT ; goal G: AD.
Actions A: add.
hR (s) =2: ⟨drBA, ulD⟩.
Real problem:
State s: BC; goal G: AD.
Actions A: pre, add, del.
drCB
CC −−−−→ BC.
386 CHAPTER 20. PLANNING II: ALGORITHMS
A
drB
We are here B BB BT AT AA
2 ul
2 drBC D
D B
rC DCdrC BT CT
1 drAB 2 drBC 2 dloC 2 drCD 2
dr dr u
AC BC B ACC C BCT lC DT
D D D
AC BC CC
P N ∪ {∞}
h∗ P
P′ ⊆ P h∗ P ′
R
“When the world changes, its previous state remains true as well.”
Real world: (before)
Real world:
(after)
Relaxed
world: (before)
Relaxed
world: (after)
Real
world: (before)
Real world:
Relaxed world:
In other words, the class of simpler problems P ′ is the set of all STRIPS tasks with
empty delete lists, and the relaxation mapping R drops the delete lists.
Definition 20.3.2 (Relaxed Plan). Let Π:=⟨P , A, I , G⟩ be a STRIPS task, and
let s be a state. A relaxed plan for s is a plan for ⟨P , A, s, G⟩+ . A relaxed plan for
I is called a relaxed plan for Π.
A relaxed plan for s is an action sequence that solves s when pretending that all
delete lists are empty.
Also called “delete-relaxed plan”: “relaxation” is often used to mean “delete-relaxation”
by default.
+
load(x) : “truck(x), pack(x) ⇒ pack(T )”.
+
unload(x) : “truck(x), pack(T ) ⇒ pack(x)”.
Relaxed plan:
+ + + + +
⟨drive(A, B) , drive(B, C) , load(C) , drive(C, D) , unload(D) ⟩
We don’t need to drive the truck back, because “it is still at A”.
PlanEx+
Definition 20.3.3 (Relaxed Plan Existence Problem). By PlanEx+ , we denote
the problem of deciding, given a STRIPS task Π:=⟨P , A, I , G⟩, whether or not there
exists a relaxed plan for Π.
Iterations on F :
20.3. DELETE RELAXATION 391
1. {at(Sy), vis(Sy)}
2. ∪ {at(Ad), vis(Ad), at(Br), vis(Br)}
3. ∪ {at(Da), vis(Da), at(Pe), vis(Pe)}
Iterations on F :
1. {truck(A), pack(C)}
2. ∪{truck(B)}
3. ∪{truck(C)}
4. ∪{truck(D), pack(T )}
Iterations on F :
1. {truck(A), pack(C)}
2. ∪{truck(B)}
3. ∪{truck(C)}
4. ∪{pack(T )}
5. ∪{pack(A), pack(B)}
6. ∪∅
5.2. Assume, for the moment, that we drop line (*) from the algorithm. It is then
easy to see that ai ∈Ai and apply(I, ⟨a+ +
0 , . . . , ai−1 ⟩) ⊆ Fi , for all i.
5.3. We get G ⊆ apply(I, ⟨a+ +
0 , . . . , an−1 ⟩) ⊆ Fn , and the algorithm returns “solv-
able” as desired.
5.4. Assume to the contrary of the claim that, in an iteration i < n, (*) fires.
Then G̸⊆F and F = F ′ . But, with F = F ′ , F = Fj for all j > i, and we get
G̸⊆Fn in contradiction.
P′ ⊆ P h∗ P ′
R
h+ is Admissible
Lemma 20.4.3. Let Π:=⟨P , A, I , G⟩ be a STRIPS task, and let s be a state. If
⟨a1 , . . ., an ⟩ is a plan for Πs :=⟨P , A, {s}, G⟩, then ⟨a+ + +
1 , . . ., an ⟩ is a plan for Π .
If we ignore deletes, the states along the plan can only get bigger.
Theorem 20.4.4. h+ is Admissible.
Proof:
1. Let Π:=⟨P , A, I , G⟩ be a STRIPS task with states P , and let s∈P .
s .
2. h+ (s) is defined as optimal plan length in Π+
3. With the lemma above, any plan for Π also constitutes a plan for Π+ s .
4. Thus optimal plan length in Π+ s can only be shorter than that in Πs i, and the
claim follows.
Real problem:
Initial state I: AC; goal G:
AD.
Actions A: pre, add, del.
Relaxed problem:
State s: AC; goal G: AD.
Relaxed problem:
State s: BC; goal G: AD.
Actions A: pre, add.
Relaxed problem:
Real problem:
State s: AC; goal G: AD.
Actions A: pre, add, del.
drBA
BC −−−−→ AC.
Real problem:
State s: AC; goal G: AD.
396 CHAPTER 20. PLANNING II: ALGORITHMS
4 ulA 5
dr
AT ABAA
5 D
A
drB
We are here B BB BT
5 ul
4 drBC D
D B
rC DCdrC BT CT
5 drAB 5 drBC 5 dloC 4 drCD 4 ulD 3 drDC 2 drCB 1 drBA 0
dr dr u dr l dr dr
AC BC B ACC C BCT lC DT DCDD oD CD C DBD B CAD
D D D D D D D
AC BC CC CT DT DD CD
h+ in the Blocksworld
A
A
B B
D C C
Optimal plan: ⟨putdown(A), unstack(B, D), stack(B, C), pickup(A), stack(A, B)⟩.
Optimal relaxed plan: ⟨stack(A, B), unstack(B, D), stack(B, C)⟩.
Observation: What can we say about the “search space surface” at the initial
state here?
The initial state lies on a local minimum under h+ , together with the successor
state s where we stacked A onto B. All direct other neighbours of these two states
have a strictly higher h+ value.
20.5 Conclusion
A Video Nugget covering this section can be found at https://fau.tv/clip/id/26906.
Summary
Heuristic search on classical search problems relies on a function h mapping states
s to an estimate h(s) of their goal distance. Such functions h are derived by solving
relaxed problems.
In planning, the relaxed problems are generated and solved automatically. There
are four known families of suitable relaxation methods: abstractions, landmarks,
20.5. CONCLUSION 397
• Chapters 10: Classical Planning and 11: Planning and Acting in the Real World in [RN09].
– Although the book is named “A Modern Approach”, the planning section was written long
before the IPC was even dreamt of, before PDDL was conceived, and several years before
heuristic search hit the scene. As such, what we have right now is the attempt of two outsiders
trying in vain to catch up with the dramatic changes in planning since 1995.
– Chapter 10 is Ok as a background read. Some issues are, imho, misrepresented, and it’s far
from being an up-to-date account. But it’s Ok to get some additional intuitions in words
different from my own.
– Chapter 11 is useful in our context here because we don’t cover any of it. If you’re interested
in extended/alternative planning paradigms, do read it.
398 CHAPTER 20. PLANNING II: ALGORITHMS
• A good source for modern information (some of which we covered in the lecture) is Jörg
Hoffmann’s Everything You Always Wanted to Know About Planning (But Were Afraid to
Ask) [Hof11] which is available online at http://fai.cs.uni-saarland.de/hoffmann/papers/
ki11.pdf
Chapter 21
Outline
So Far: we made idealizing/simplifying assumptions:
The environment is fully observable and deterministic.
21.1 Introduction
A Video Nugget covering this section can be found at https://fau.tv/clip/id/26908.
399
400 CHAPTER 21. SEARCHING, PLANNING, AND ACTING IN THE REAL WORLD
Definition 21.1.4. The qualification problem in planning is that we can never finish
listing all the required preconditions and possible conditional effects of actions.
Root Cause: The environment is partially observable and/or non-deterministic.
Technical Problem: We cannot know the “current state of the world”, but search/-
planning algorithms are based on this assumption.
Actions:
remove lid from can
paint object with paint from open can.
We formalize the example in PDDL for simplicity. Note that the :percept scheme is not part of
the official PDDL, but fits in well with the design.
...)
The PDDL problem file has a “free” variable ?c for the (undetermined) joint
color.
(define (problem tc−coloring)
(:domain furniture−objects)
(:objects table chair c1 c2)
(:init (object table) (object chair) (can c1) (can c2) (inview table))
(:goal (color chair ?c) (color table ?c)))
Two action schemata: remove can lid to open and paint with open can
(:action remove−lid
:parameters (?x)
:precondition (can ?x)
:effect (open can))
(:action paint
:parameters (?x ?y)
:precondition (and (object ?x) (can ?y) (color ?y ?c) (open ?y))
:effect (color ?x ?c))
has a universal variable ?c for the paint action ⇝ we cannot just give paint a
color argument in a partially observable environment.
Sensorless Plan: Open one can, paint chair and table in its color.
Note: Contingent planning can create better plans, but needs perception
Two percept schemata: color of an object and color in a can
(:percept color
:parameters (?x ?c)
:precondition (and (object ?x) (inview ?x)))
(:percept can−color
:parameters (?x ?c)
:precondition (and (can ?x) (inview ?x) (open ?x)))
To perceive the color of an object, it must be in view, a can must also be open.
Note: In a fully observable world, the percepts would not have preconditions.
An action schema: look at an object that causes it to come into view.
(:action lookat
:parameters (?x)
:precond: (and (inview ?y) and (notequal ?x ?y))
:effect (and (inview ?x) (not (inview ?y))))
Contingent Plan:
1. look at furniture to determine color, if same ; done.
2. else, look at open and look at paint in cans
3. if paint in one can is the same as an object, paint the other with this color
4. else paint both in any color
Conditional Plans
Definition 21.3.1. Conditional plans extend the possible actions in plans by condi-
tional steps that execute sub plans conditionally whether K + P |= C, where K + P
is the current knowledge base + the percepts.
Suck Right
8 5
GOAL LOOP
Suck Right
; OR graph Right
[L1 : lef t, if AtR then L1 else [if CleanL then ∅ else suck fi] fi] or
[while AtR do [lef t] done, if CleanL then ∅ else suck fi]
We have an infinite loop but plan eventually works unless action always fails.
Problem: We do not know with certainty what state the world is in!
Idea: Just keep track of all the possible states it could be in.
Definition 21.4.1. A model based agent has a world model consisting of
a belief state that has information about the possible states the world may be
in, and
a sensor model that updates the belief state based on sensor information
a transition model that updates the belief state based on actions.
Idea: The agent environment determines what the world model can be.
In a fully observable, deterministic environment,
we can observe the initial state and subsequent states are given by the actions
alone.
thus the belief state is a singleton set (we call its member the world state) and
the transition model is a function from states and actions to states: a transition
function.
That is exactly what we have been doing until now: we have been studying methods that
build on descriptions of the “actual” world, and have been concentrating on the progression from
atomic to factored and ultimately structured representations. Tellingly, we spoke of “world states”
instead of “belief states”; we have now justified this practice in the brave new belief-based world
models by the (re-) definition of “world states” above. To fortify our intuitions, let us recap from
a belief-state-model perspective.
Let us now see what happens when we lift the restrictions of total observability and determin-
ism.
Note: This even applies to online problem solving, where we can just perceive
the state. (e.g. when we want to optimize utility)
In a deterministic, but partially observable environment,
the belief state must deal with a set of possible states.
we can use transition functions.
We need a sensor model, which predicts the influence of percepts on the belief
state – during update.
In a stochastic, partially observable environment,
mix the ideas from the last two. (sensor model + transition relation)
Decision-Theoretic Agents:
In a partially observable, stochastic environment
belief state + transition model =
b decision networks,
inference =
b maximizing expected utility.
Conformant/Sensorless Planning
408 CHAPTER 21. SEARCHING, PLANNING, AND ACTING IN THE REAL WORLD
Definition 21.5.1. Conformant or sensorless planning tries to find plans that work
without any sensing. (not even the initial state)
Example 21.5.2 (Sensorless Vacuum Cleaner World).
Observation 21.5.3. In a sensorless world we do not know the initial state. (or
any state after)
Observation 21.5.4. Sensorless planning must search in the space of belief states
(sets of possible actual states).
Example 21.5.5 (Searching the Belief State Space).
Start in {1, 2, 3, 4, 5, 6, 7, 8}
Solution: [right, suck, lef t, suck] right → {2, 4, 6, 8}
suck → {4, 8}
lef t → {3, 7}
suck → {7}
Let us see if we can understand the options for T b (a, S) a bit better. The first question is when we
want an action a to be applicable to a belief state S ⊆ S, i.e. when should T b (a, S) be non-empty.
21.5. SEARCHING/PLANNING WITHOUT OBSERVATIONS 409
In the first case, ab would be applicable iff a is applicable to some s∈S, in the second case if a is
applicable to all s∈S. So we only want to choose the first case if actions are harmless.
The second question we ask ourselves is what should be the results of applying a to S ⊆ S?,
again, if actions are harmless, we can just collect the results, otherwise, we need to make sure that
all members of the result ab are reached for all possible states in S.
R
L R
L
S S
R R
L R L R
L L
S S
S S
R
L R
S S
Figure 3.3 The state space for the vacuum world. Links denote actions: L = Left, R =
Right, S = Suck.
Problem: Belief states are HUGE; e.g. initial belief state for the 10 × 10 vacuum
410 CHAPTER 21. SEARCHING, PLANNING, AND ACTING IN THE REAL WORLD
The update stage determines, for each possible percept, the resulting belief
state: UPDATE(bb, o):={s|o = PERC(s) and s∈bb}
The functions PRED and PERC are the main parameters of this model. We define
RESULT(b, a):={UPDATE(PRED(b, a), o)|PossPERC(PRED(b, a))}
[B,Dirty] 2
Right
1 2
(a)
3 4
[B,Dirty] 2
Right
1 2
[B,Clean] 4
(a)
3 4
[B,Clean] 4
The action Right is deterministic, sensing disambiguates to singletons Slippery
World:
2
[B,Dirty]
2
[B,Dirty]
Right 2
Right 2
1 1 [A,Dirty] 1
(b) 1 1 [A,Dirty] 1
(b)
3 33 3
3
3
44
[B,Clean]
[B,Clean]4
4
Figure 4.14 Two examples of transitions in local-sensing vacuum worlds. (a) In the deter-
The action Right is non-deterministic, sensing disambiguates somewhat
ministic world, Right is applied in the initial belief state, resulting in a new predicted belief
Figure 4.14 Two examples of transitions in local-sensing vacuum worlds. (a) In the deter-
state with two possible physical states; for those states, the possible percepts are [R, Dirty]
ministic world,
and [R,Right
Clean],isleading
applied in belief
to two the initial belief
states, each state,isresulting
of which a singleton.in(b)a In
new predicted belief
the slippery
state with world,
two possible physical
Right is applied in thestates; for those
initial belief states,a the
state, giving new possible
belief statepercepts are [R, Dirty]
with four physi-
2
[B,Dirty]
Right 2
1 1 [A,Dirty] 1
(b)
3 3 3
Figure 4.14 Two examples of transitions in local-sensing vacuum worlds. (a) In the deter-
ministic world, Right is applied in the initial belief state, resulting in a new predicted belief
Belief-State Search with Percepts
state with two possible physical states; for those states, the possible percepts are [R, Dirty]
and [R, Clean], leading to two belief states, each of which is a singleton. (b) In the slippery
world, Right is The
Observation: applied in the initialtransition
belief-state belief state, givinginduces
model a new belief
an OR stategraph.
with four physi-
cal states; for those states, the possible percepts are [L, Dirty], [R, Dirty], and [R, Clean],
Idea: Use
leading OR search
to three in non
belief states deterministic environments.
as shown.
3
Suck Right
5
2 4
7
36 Chapter 4 Search in Complex Environments
Figure 4.15 The first level of the AND – OR search tree for a problem in the local-sensing
Solution:
vacuum world;
[Suck, is theiffirst
Right,
Suck action=
Bstate in {6} then Suck else [] fi]
the solution.
1 5 5 6 2
Michael Kohlhase: Artificial Intelligence 2 670 2023-02-10
3 7 7 4 6
Contingent Planning
Definition 21.6.7. The generation of plan with conditional branching based on
percepts is called contingent planning, solutions are called contingent plans.
Appropriate for partially observable or non-deterministic environments.
Example 21.6.8. Continuing Example 21.2.1.
One of the possible contingent plan is
((lookat table) (lookat chair)
(if (and (color table c) (color chair c)) (noop)
((removelid c1) (lookat c1) (removelid c2) (lookat c2)
(if (and (color table c) (color can c)) ((paint chair can))
(if (and (color chair c) (color can c)) ((paint table can))
((paint chair c1) (paint table c1)))))))
Note: Variables in this plan are existential; e.g. in
line 2: If there is come joint color c of the table and chair ; done.
line 4/5: Condition can be satisfied by [c1 /can] or [c2 /can] ; instantiate ac-
cordingly.
Definition 21.6.9. During plan execution the agent maintains the belief state b,
chooses the branch depending on whether b |= c for the condition c.
Note: The planner must make sure b |= c can always be decided.
Here:
Given an action a and percepts p = p1 ∧ . . . ∧ pn , we have
414 CHAPTER 21. SEARCHING, PLANNING, AND ACTING IN THE REAL WORLD
Definition 21.7.2. The competitive ratio of an online problem solving agent is the
quotient of
offline performance, i.e. cost of optimal solutions with full information and
online performance, i.e. the actual cost induced by online problem solving.
FigureG4.18 (i.e.
3
Down(1, 1) results in (1,1) back)
A simple maze problem. The agent starts at S and must reach G but
1 2 3
nothing of theFigure
environment.
4.18 A simple maze problem. The agent starts at S and must reach G but knows
nothing of the environment.
2
1 S
Observation 21.7.5. No online algorithm can avoid dead ends in all state
G spaces.
(a) (b)
Example 21.7.6. Two state spaces that lead an onlineFigure
agent4.19 (a)into
Two state dead ends:
spaces that might lead an online search agent into a dead end.
S
Any given agent will fail in at least one of these spaces. (b) A two-dimensional environment
that can cause an online search agent to follow an arbitrarily inefficient route to the goal.
Whichever choice the agent makes, the adversary blocks that route with another long, thin
wall, so that the path followed is much longer than the best possible path.
G
S A S A
G
S (a)
G (b)
Any agent will fail in at least one of the spaces.
Figure 4.19 (a) Two state spaces that might lead an online search agent into a de
Any given agent will fail in at least one of these spaces. (b) A two-dimensional envir
Definition 21.7.7. We call Example 21.7.6 that
an adversary
can cause anargument.
online search agent to follow an arbitrarily inefficient route to th
S A Whichever choice the agent makes, the adversary blocks that route with another lon
Example 21.7.8. Forcing an online agent into
wall,an
so arbitrarily
that the pathinefficient route:longer than the best possible path.
followed is much
G
(a) (b)
Figure 4.19 (a) Two state spaces that might lead an online search agent into a dead end.
Any given agent will fail in at least one of these spaces. (b) A two-dimensional environment
that can cause an online search agent to follow an arbitrarily inefficient route to the goal.
Whichever choice the agent makes, the adversary blocks that route with another long, thin
wall, so that the path followed is much longer than the best possible path.
G
416 CHAPTER 21. SEARCHING, PLANNING, AND ACTING IN THE REAL WORLD
S A
S G
Whichever choice the agent makes
the adversary can block with a
long, thin wall
S A
G
(a) (b)
FigureDead
Observation: 4.19 (a) Two are
ends stateaspaces that might lead
real problem an online search
for robots: ramps, agent into acliffs,
stairs, dead end.
...
Any given agent will fail in at least one of these spaces. (b) A two-dimensional environment
Definitionthat can causeAan state
21.7.9. online search
spaceagent to follow
is called an arbitrarily
safely inefficient
explorable, iff route to thestate
a goal goal. is
Whichever choice the agent makes, the adversary blocks that route with another long, thin
reachable from every
wall, so reachable
that the state.
path followed is much longer than the best possible path.
Therefore, e.g. A∗ can expand any node in the fringe, but an online agent must go
there to explore it.
Intuition: It seems best to expand nodes in “local order” to avoid spurious travel.
Idea: Depth first search seems a good fit. (must only travel for backtracking)
Figure 11.11 A hierarchical planning algorithm that uses angelic semantics to identify and
Replanning for Plan
commit to high-level Repair
plans that work while avoiding high-level plans that don’t. The predi-
cate M AKING -P ROGRESS checks to make sure that we aren’t stuck in an infinite regression
of Generally:
Replanning
refinements. At top level, when the agent’s
call A NGELIC modelwith
-S EARCH of the world
[Act] is initialPlan
as the incorrect. .
Example 21.8.4 (Plan Repair by Replanning). Given a plan from S to G.
Figure 11.12 At first, the sequence “whole plan” is expected to get the agent from S to G.
The agent executes steps of the plan until it expects to be in state E, but observes that it is
actually in O. The agent then replans for the minimal repair plus continuation to reach G.
418 CHAPTER 21. SEARCHING, PLANNING, AND ACTING IN THE REAL WORLD
The agent executes wholeplan step by step, monitoring the rest (plan).
After a few steps the agent expects to be in E, but observes state O.
Replanning: by calling the planner recursively
find state P in wholeplan and a plan repair from O to P . (P may be G)
minimize the cost of repair + continuation
Definition 21.8.5. There are three levels of execution monitoring: before executing
an action
action monitoring checks whether all preconditions still hold.
plan monitoring checks that the remaining plan will still succeed.
goal monitoring checks whether there is a better set of goals it could try to
achieve.
Note: Example 21.8.4 was a case of action monitoring leading to replanning.
Idea: On failure, resume planning (e.g. by POP) to achieve open conditions from
current state.
Definition 21.8.6. IPEM (Integrated Planning, Execution, and Monitoring):
Semester Change-Over
Planning Frameworks
Planning Algorithms
Planning and Acting in the real world
423
424 CHAPTER 22. SEMESTER CHANGE-OVER
Agent Sensors
Percepts
Environment
?
Actions
Actuators
Figure 2.1 Agents interact with environments through sensors and actuators.
Section 2.4. Simple
The Structure of Agents
Reflex Agents 49
there is to say about the agent. Mathematically speaking, we say that an agent’s behavior is
AGENT FUNCTION described by the agent Agent function that maps any given percept sequence to an action.
Sensors
We can imagine tabulating the agent function that describes any given agent; for most
agents, this would be a very large table—infinite, Whatin the fact,
world unless we place a bound on the
is like now
length of percept sequences we want to consider. Given an agent to experiment with, we can,
Environment
in principle, construct this table by trying out all possible percept sequences and recording
which actions the agent does in response.1 The table is, of course, an external characterization
of the agent. Internally, the agent function for an artificial agent will be implemented by an
AGENT PROGRAM agent program. It is important to keep these two ideas distinct. The agent function is an
abstract mathematical Condition-action
description; rules
the agent program is aI concrete implementation, running
What action
should do now
within some physical system.
To illustrate these ideas, we use a very simple example—the vacuum-cleaner world
Actuators
shown in Figure 2.2. This world is so simple that we can describe everything that happens;
it’s also a made-up world, so we can invent many variations. This particular world has just two
Figuresquares
locations: 2.9 Schematic
A and B.diagram of a simple
The vacuum agentreflex agent. which square it is in and whether
perceives
Reflex
there isAgents
dirt in with State It can choose to move left, move right, suck up the dirt, or do
the square.
nothing. One very simple agent function is the following: if the current square is dirty, then
suck; otherwise,
function S IMPLEmove to the-Aother
-R EFLEX GENTsquare.
( perceptA partial tabulation
) returns an action of this agent function is shown
persistent:
in Figure 2.3 and an agent
rules, program
a set of that implements
condition–action rules it appears in Figure 2.8 on page 48.
Looking
state at Figure-I2.3,
← I NTERPRET we see that various vacuum-world agents can be defined simply
NPUT( percept )
by filling in the right-hand column
rule ← RULE -M ATCH(state, rules) in various ways. The obvious question, then, is this: What
is theaction
right ←way to fill
rule.A CTIONout the table? In other words, what makes an agent good or bad,
intelligent
returnor stupid? We answer these questions in the next section.
action
1 If the agent uses some randomization to choose its actions, then we would have to try each sequence many
Figure 2.10 A simple reflex agent. It acts according to a rule whose condition matches
times to identify the probability of each action. One might imagine that acting randomly is rather silly, but we
the current
show later state, as
in this chapter defined
that byvery
it can be the intelligent.
percept.
Section
22.1. 2.4.
WHATThe
DIDStructure of AgentsIN AI 1?
WE LEARN 51 425
Sensors
State
How the world evolves What the world
is like now
Environment
What my actions do
Agent Actuators
Environment
What it will be like
state ← U PDATE -S What (state,
TATEmy action
actions do , percept ,ifmodel ) A
I do action
rule ← RULE -M ATCH(state, rules)
action ← rule.ACTION
return action
What action I
Goals should do now
Figure 2.12 A model-based reflex agent. It keeps track of the current state of the world,
using an internal model. It then chooses an action in the same way as the reflex agent.
Agent Actuators
Performance standard
Critic Sensors
feedback
Environment
changes
Learning Performance
element element
knowledge
learning
goals
Problem
generator
Actuators
Agent
He estimates how much work this might take and concludes “Some more expeditious method
seems desirable.” The method he proposes is to build learning machines and then to teach
Rational
them. InAgentmany areas of AI, this is now the preferred method for creating state-of-the-art
systems. Learning has another advantage, as we noted earlier: it allows the agent to operate
in initially
Idea: Tryunknown
to design environments
agents that andare
to become more competent than(do
successful its initial knowledge
the right thing)
alone might allow. In this section, we briefly introduce the main ideas of learning agents.
Definition
Throughout 22.1.1.
the book, An we agent
comment is called rational, if
on opportunities andit methods
chooses for
whichever
learning inaction max-
particular
kinds of
imizes theagents.
expected Part Vvalue
goes into
of themuch more depth on
performance the learning
measure givenalgorithms
the perceptthemselves.
sequence
to date.A learning
This is calledagent canthebeMEU divided into four conceptual components, as shown in Fig-
principle.
LEARNING ELEMENT ure 2.15. The most important distinction is between the learning element, which is re-
PERFORMANCE
ELEMENT
Note:
sponsibleAfor rational
makingagent need notand
improvements, bethe
perfect
performance element, which is responsible for
selecting external actions. The performance element is what we have previously considered
to only
be theneeds to maximize
entire agent: it takes inexpected
percepts value
and decides on actions. The (rational
learning omniscient)
̸=element uses
CRITIC feedback
need from critic on
notthepredict e.g.how verytheunlikely
agent is but
doing and determines
catastrophic how in
events thethe
performance
future
element should be modified to do better in the future.
percepts may not supply all relevant information (Rational ̸= clairvoyant)
The design of the learning element depends very much on the design of the performance
if we
element. When cannot
tryingperceive
to design things
an agentwe dolearns
that not need to react
a certain to them.
capability, the first question is
not “How am I going to get it to learn this?” but
but we may need to try to find out about hidden dangers “What kind of performance element will my
(exploration)
agent need to do this once it has learned how?” Given an agent design, learning mechanisms
action outcomes may not be as expected (rational ̸= successful)
can be constructed to improve every part of the agent.
but we may need to take action to ensure that they dowith
The critic tells the learning element how well the agent is doing (morerespect to a fixed
often)
performance standard.
(learning) The critic is necessary because the percepts themselves provide no
indication of the agent’s success. For example, a chess program could receive a percept
Rational
indicating; thatexploration, learning,
it has checkmated autonomy
its opponent, but it needs a performance standard to know
that this is a good thing; the percept itself does not say so. It is important that the performance
22.2 Administrativa
We will now go through the ground rules for the course. This is a kind of a social contract
between the instructor and the students. Both have to keep their side of the deal to make learning
as efficient and painless as possible.
Now we come to a topic that is always interesting to the students: the grading scheme.
Assessment, Grades
Academic Assessment: 90 minutes exam directly after courses end (∼ July 25
2023)
Retake Exam: 90 min exam directly after courses end the following semester (∼
Feb. 13. 2023)
Module Grade:
Grade via the exam (Klausur) ; 100% of the grade
Results from “Übungen zu Künstliche Intelligenz” give up to 10% bonus to an
exam with ≥ 50% points. (not passed ; no bonus)
I do not think that this is the best possible scheme, but I have very little choice.
I basically do not have a choice in the grading sheme, as it is essentially the only one consistent with
university/state policies. For instance, I would like to give you more incentives for the homework
assignments – which would also mitigate the risk of having a bad day in the exam. Also, graded
quizzes would help you prepare for the lectures and thus let you get more out of them, but that
is also impossible.
430 CHAPTER 22. SEMESTER CHANGE-OVER
Double Jeopardy : Homeworks only give 10% bonus points for the
exam, but without trying you are unlikely to pass the exam.
Admin: To keep things running smoothly
Homeworks will be posted on StudOn.
Sign up for AI-2 under https://www.studon.fau.de/crs4419186.html.
Homeworks are handed in electronically there. (plain text, program files, PDF)
Go to the tutorials, discuss with your TA! (they are there for you!)
Homework Discipline:
Start early! (many assignments need more than one evening’s work)
Don’t start by sitting at a blank screen (talking & study group help)
Humans will be trying to understand the text/code/math when grading it.
It is very well-established experience that without doing the homework assignments (or something
similar) on your own, you will not master the concepts, you will not even be able to ask sensible
questions, and take nothing home from the course. Just sitting in the course and nodding is not
enough! If you have questions please make sure you discuss them with the instructor, the teaching
assistants, or your fellow students. There are three sensible venues for such discussions: online in
the lecture, in the tutorials, which we discuss now, or in the course forum – see below. Finally, it
is always a very good idea to form study groups with your friends.
Approach: Weekly tutorials and homework assignments (first one in week two)
Goal 1: Reinforce what was taught in class. (you need practice)
Goal 2: Allow you to ask any question you have in a protected environment.
Instructor/Lead TA:
Florian Rabe (KWARC Postdoc)
Room: 11.137 @ Händler building, florian.rabe@fau.de
Also: Group submission has not worked well in the past! (too many freeloaders)
Do use the opportunity to discuss the AI-2 topics with others. After all, one of the non-trivial
skills you want to learn in the course is how to talk about Artificial Intelligence topics. And that
takes practice, practice, and practice. But what if you are not in a lecture or tutorial and want
to find out more about the AI-2 topics?
FAU has issued a very insightful guide on using lecture recordings. It is a good idea to heed these
recommendations, even if they seem annoying at first.
Using lecture
Attend lectures.
Catch up.
Due to the current AI hype, the course Artificial Intelligence is very popular and thus many
degree programs at FAU have adopted it for their curricula. Sometimes the course setup that fits
for the CS program does not fit the other’s very well, therefore there are some special conditions.
I want to state here.
I can only warn of what I am aware, so if your degree program lets you jump through extra hoops,
please tell me and then I can mention them here.
We restart the new semester by reminding ourselves of (the problems, methods, and issues of)
Artificial Intelligence, and what has been achived so far.
The first question we have to ask ourselves is “What is Artificial Intelligence?”, i.e. how can we
define it. And already that poses a problem since the natural definition like human intelligence,
but artificially realized presupposes a definition of Intelligence, which is equally problematic; even
Psychologists and Philosophers – the subjects nominally “in charge” of human intelligence – have
problems defining it, as witnessed by the plethora of theories e.g. found at [WHI].
Maybe we can get around the problems of defining “what Artificial intelligence is”, by just de-
scribing the necessary components of AI (and how they interact). Let’s have a try to see whether
that is more informative.
The components of Artificial Intelligence are quite daunting, and none of them are fully un-
derstood, much less achieved artificially. But for some tasks we can get by with much less. And
indeed that is what the field of Artificial Intelligence does in practice – but keeps the lofty ideal
around. This practice of “trying to achieve AI in selected and restricted domains” (cf. the discus-
sion starting with slide 27) has borne rich fruits: systems that meet or exceed human capabilities
in such areas. Such systems are in common use in many domains of application.
in outer space
in outer space systems
need autonomous con-
trol:
remote control impos-
sible due to time lag
in artificial limbs
the user controls the
prosthesis via existing
nerves, can e.g. grip
a sheet of paper.
in household appliances
The iRobot Roomba
vacuums, mops, and
sweeps in corners, . . . ,
parks, charges, and
discharges.
general robotic house-
hold help is on the
horizon.
in hospitals
in the USA 90% of the
prostate operations are
carried out by Ro-
boDoc
Paro is a cuddly robot
that eases solitude in
nursing homes.
438 CHAPTER 22. SEMESTER CHANGE-OVER
The AI Conundrum
Observation: Reserving the term “Artificial Intelligence” has been quite a land
grab!
But: researchers at the Dartmouth Conference (1950) really thought they would
solve/reach AI in two/three decades.
Consequence: AI still asks the big questions.
Another Consequence: AI as a field is an incubator for many innovative tech-
nologies.
Still Consequence: AI research was alternatingly flooded with money and cut off
brutally.
There are currently three main avenues of attack to the problem of building artificially intelligent
systems. The (historically) first is based on the symbolic representation of knowledge about the
world and uses inference-based methods to derive new knowledge on which to base action decisions.
The second uses statistical methods to deal with uncertainty about the world state and learning
methods to derive new (uncertain) world assumptions to act on.
As a consequence, the field of Artificial Intelligence (AI) is an engineering field at the intersec-
tion of computer science (logic, programming, applied statistics), cognitive science (psychology,
neuroscience), philosophy (can machines think, what does that mean?), linguistics (natural lan-
guage understanding), and mechatronics (robot hardware, sensors).
Subsymbolic AI and in particular machine learning is currently hyped to such an extent, that
many people take it to be synonymous with “Artificial Intelligence”. It is one of the goals of this
course to show students that this is a very impoverished view.
We combine the topics in this way in this course, not only because this reproduces the historical
development but also as the methods of statistical and subsymbolic AI share a common basis.
It is important to notice that all approaches to AI have their application domains and strong points.
We will now see that exactly the two areas, where symbolic AI and statistical/subsymbolic AI
have their respective fortes correspond to natural application areas.
Precision
100% Producer Tasks
General Rule: Subsymbolic AI is well suited for consumer tasks, while symbolic
AI is better suited for producer tasks.
An example of a producer task – indeed this is where the name comes from – is the case of a
machine tool manufacturer T , which produces digitally programmed machine tools worth multiple
22.3. OVERVIEW OVER AI AND TOPICS OF AI-II 441
million Euro and sells them into dozens of countries. Thus T must also comprehensive machine
operation manuals, a non-trivial undertaking, since no two machines are identical and they must
be translated into many languages, leading to hundreds of documents. As those manual share a lot
of semantic content, their management should be supported by AI techniques. It is critical that
these methods maintain a high precision, operation errors can easily lead to very costly machine
damage and loss of production. On the other hand, the domain of these manuals is quite restricted.
A machine tool has a couple of hundred components only that can be described by a comple of
thousand attribute only.
Indeed companies like T employ high-precision AI techniques like the ones we will cover in this
course successfully; they are just not so much in the public eye as the consumer tasks.
We always try to find a topic at the intersection of your and our interests.
1
We also often have positions!. (HiWi, Ph.D.: 2 , PostDoc: full)
ascribed to the agent behaving rationally, i.e. optimizing the expected utility of its actions given
the (current) environment.
In the last semester we restricted ourselves to fully observable, deterministic, episodic environ-
ments, where optimizing utility is easy in principle – but may still be computationally intractable,
since we have full information about the world
An agent is an entity that perceives its environment through sensors and acts upon
that environment through actuators.
A rational agent is an agent maximizing its expected performance measure.
In AI-1 we dealt mainly with a logical approach to agent design (no uncertainty).
We ignored
interface to environment (sensors, actuators)
uncertainty
the possibility of self-improvement (learning)
This semester we want to alleviate all these restrictions and study rationality in more realistic
circumstances, i.e. environments which need only be partially observe and where our actions can
be non deterministic. Both of these extensions conspire to allow us only partial knowledge about
the world, so that we can only optimize “expected utility” instead of “ actual utility” of our actions.
This directly leads to the first topic.
444 CHAPTER 22. SEMESTER CHANGE-OVER
The second topic is motivated by the fact that environments can change and and are initially
unknown, and therefore the agent must obtain and/or update parameters like utilities and world
knowledge by observing the environment.
The last topic (which we will only attack if we have time) is motivated by multi agent environ-
ments, where multiple agents have to collaborate for problem solving. Note that even though the
adversarial search methods discussed in chapter 9 were essentially single agent as both opponents
optimized the utility of their actions alone.
In true multi agent environments we have to also optimize collaboration between agents, and
that is usually radially more efficient if agents can communicate.
Part V
445
447
This part of the course notes addresses inference and agent decision making in partially observable
environments, i.e. where we only know probabilities instead of certainties whether propositions
are true/false. We cover basic probability theory and – based on that – Bayesian Networks and
simple decision making in such environments. Finally we extend this to probabilistic temporal
models and their decision theory.
448
Chapter 23
Quantifying Uncertainty
In this chapter we develop a machinery for dealing with uncertainty: Instead of thinking about
what we know to be true, we must think about what is likely to be true.
Non-deterministic actions:
“When I try to go forward in this dark cave, I might actually go forward-left or
forward-right.”
449
450 CHAPTER 23. QUANTIFYING UNCERTAINTY
Unreliable Sensors
Robot Localization: Suppose we want to support localization using landmarks
to narrow down the area.
Example 23.1.1. If you see the Eiffel tower, then you’re in Paris.
Difficulty: Sensors can be imprecise.
Even if a landmark is perceived, we cannot conclude with certainty that the
robot is at that location.
This is the half-scale Las Vegas copy, you dummy.
Even if a landmark is not perceived, we cannot conclude with certainty that the
robot is not at that location.
Top of Eiffel tower hidden in the clouds.
Agent Sensors
Percepts
Environment
?
Actions
Actuators
Figure 2.1 Agents interact with environments through sensors and actuators.
Different agents differ on the contents of the white box in the center.
AGENT FUNCTION described by the agent function that maps any given percept sequence to an action.
We can imagine tabulating the agent function that describes any given agent; for most
Rationality
agents, this would be a very large table—infinite, in fact, unless we place a bound on the
length of percept sequences we want to consider. Given an agent to experiment with, we can,
in principle,
Idea: Tryconstruct this agents
to design table by that trying out all possible percept
are successful! sequences
(aka. “do the rightand recording
thing”)
which actions the agent does in response.1 The table is, of course, an external characterization
Definition
of the 23.1.4. the
agent. Internally, A performance
agent function measure
for anisartificial
a function thatwill
agent evaluates a sequenceby an
be implemented
of environments.
AGENT PROGRAM agent program. It is important to keep these two ideas distinct. The agent function is an
abstract mathematical
Example 23.1.5.description;
A performance the agent measure program
for theisvacuum
a concrete implementation,
cleaner world could running
within some physical system.
award one point per square cleaned up in time T ?
To illustrate these ideas, we use a very simple example—the vacuum-cleaner world
shown in Figureone
award 2.2.point
Thisper cleanis square
world so simple per that
timewestep,
canminus one everything
describe per move? that happens;
it’s alsoapenalize
made-upforworld,
> k so we squares?
dirty can invent many variations. This particular world has just two
locations: squares A and B. The vacuum agent perceives which square it is in and whether
Definition 23.1.6. An agent is called rational, if it chooses whichever action max-
there is dirt in the square. It can choose to move left, move right, suck up the dirt, or do
imizes the expected value of the performance measure given the percept sequence
nothing. One very simple agent function is the following: if the current square is dirty, then
to date.
suck; otherwise, move to the other square. A partial tabulation of this agent function is shown
Question:
in Figure Why
2.3 and an is rationality
agent program that a good quality toit aim
implements for? in Figure 2.8 on page 48.
appears
Looking at Figure 2.3, we see that various vacuum-world agents can be defined simply
by filling in the right-hand column in various ways. The716
Michael Kohlhase: Artificial Intelligence 2
obvious question,
2023-02-10
then, is this: What
is the right way to fill out the table? In other words, what makes an agent good or bad,
intelligent or stupid? We answer these questions in the next section.
Consequences of Rationality: Exploration, Learning, Autonomy
1 If the agent uses some randomization to choose its actions, then we would have to try each sequence many
times to identify the probability of each action. One might imagine that acting randomly is rather silly, but we
Note:
showlater in thisachapter
rational
thatagent need
it can be verynot be perfect
intelligent.
452 CHAPTER 23. QUANTIFYING UNCERTAINTY
Autonomy avoids fixed behaviors that can become unsuccessful in a changing en-
vironment. (anything else would be
irrational)
The agent has to learning agentlearn all relevant traits, invariants, properties of the
environment and actions.
Environment types
23.1. DEALING WITH UNCERTAINTY: PROBABILITIES 453
Agent Sensors
Actuators
Figure 2.10 A simple reflex agent. It acts according to a rule whose condition matches
the current state, as defined by the percept.
454 CHAPTER 23. QUANTIFYING UNCERTAINTY
Sensors
State
How the world evolves What the world
is like now
Environment
What my actions do
Agent Actuators
to make a decision.
A perhaps less obvious point about the internal “state” maintained by a model-based
23.1.3 Agent Architectures based on Belief States
agent is that it does not have to describe “what the world is like now” in a literal sense. For
A Video Nugget covering this subsection can be found at https://fau.tv/clip/id/29041.
We are now ready to proceed to environments which can only partially observed and where are
our actions are non deterministic. Both sources of uncertainty conspire to allow us only partial
knowledge about the world, so that we can only optimize “expected utility” instead of “actual
utility” of our actions.
23.1. DEALING WITH UNCERTAINTY: PROBABILITIES 455
a belief state that has information about the possible states the world may be
in, and
a sensor model that updates the belief state based on sensor information
a transition model that updates the belief state based on actions.
Idea: The agent environment determines what the world model can be.
That is exactly what we have been doing until now: we have been studying methods that
build on descriptions of the “actual” world, and have been concentrating on the progression from
atomic to factored and ultimately structured representations. Tellingly, we spoke of “world states”
instead of “belief states”; we have now justified this practice in the brave new belief-based world
models by the (re-) definition of “world states” above. To fortify our intuitions, let us recap from
a belief-state-model perspective.
Let us now see what happens when we lift the restrictions of total observability and determin-
ism.
Decision-Theoretic Agents:
In a partially observable, stochastic environment
belief state + transition model =
b decision networks,
inference =
b maximizing expected utility.
(deterministic) world states. Let us evaluate whether this is enough for them to survive in the
world.
Rational Agents:
We have a choice of actions (go to FRA early, go to FRA just in time).
These can lead to different solutions with different probabilities.
The actions have different costs.
The results have different utilities (safe timing/dislike airport food).
460 CHAPTER 23. QUANTIFYING UNCERTAINTY
A rational agent chooses the action with the maximum expected utility.
Decision Theory = Utility Theory + Probability Theory.
Utility-based agents
Definition 23.1.24. A utility based agent uses a world model along with a utility
function that models its preferences among the states of that world. It chooses the
action that leads to the best expected utility.
54 Chapter 2. Intelligent Agents
Agent Schema:
Sensors
State
What the world
How the world evolves is like now
Environment
What it will be like
What my actions do if I do action A
What action I
should do now
13 Agent
QUANTIFYING
UNCERTAINTY
to theKohlhase: Artificial Intelligence 2
best expected
Actuators
Figure 2.14 A model-based, utility-based agent. It uses a model of the world, along with
a utility function that measures its preferences among states of the world. Then it chooses the
action that leadsMichael 734
utility, where expected 2023-02-10
utility is computed by averaging
over all possible outcome states, weighted by the probability of the outcome.
Decision-Theoretic Agent
outcome. (Appendix A defines expectation more precisely.) In Chapter 16, we show that any
rational agent must behave as if it possesses a utility function whose expected value it tries
Example
to maximize. 23.1.25
An agent (Athat particular
possesses kind an explicitof utility-based
utility functionagent).
can make rational decisions
with a general-purpose algorithm that does not depend on the specific utility function being
maximized. In this way, the “global” definition of rationality—designating as rational those
function DT-AGENT( percept ) returns an action
agent functions that have the highest performance—is turned into a “local” constraint on
persistent: belief state , probabilistic beliefs about the current state of the world
rational-agent designs that can be expressed in a simple program.
action, the agent’s action
The utility-based agent structure appears in Figure 2.14. Utility-based agent programs
appear in Partbelief
update IV, where
statewe baseddesign decision-making
on action and percept agents that must handle the uncertainty
inherent in stochastic
calculate outcome or partially
probabilities observable environments.
for actions,
At this point,
given the descriptions
action reader may be and wondering,
current belief“Is it state
that simple? We just build agents that
maximize expected utility, and we’re
select action with highest expected utility done?” It’s true that such agents would be intelligent,
but it’s not simple. A utility-based agent has
given probabilities of outcomes and utility information to model and keep track of its environment,
tasks return
that have involved
action a great deal of research on perception, representation, reasoning,
and learning. The results of this research fill many of the chapters of this book. Choosing
the utility-maximizing course of action is also a difficult task, requiring ingenious algorithms
that fill several
Figure 13.1more A chapters. Even with agent
decision-theoretic these that
algorithms, perfect actions.
selects rational rationality is usually
unachievable in practice because
Michael Kohlhase: of computational
Artificial Intelligence 2 complexity,
735 as we noted in Chapter 1.
2023-02-10
The basic insight about how to invert the “direction” of conditional probabilities.
Conditional Independence: How to capture and exploit complex relations be-
tween random variables?
Explains the difficulties arising when using Bayes’ rule on multiple evidences.
conditional independence is used to ameliorate these difficulties.
Probabilistic Models
Definition 23.2.1. A probability theory is an assertion language for talking about
possible worlds and an inference method for quantifying the degree of belief in such
assertions.
Remark: Like logic, but for non binary belief degree.
462 CHAPTER 23. QUANTIFYING UNCERTAINTY
The possible worlds are mutually exclusive: different possible worlds cannot both
be the case and exhaustive: one possible world must be the case.
This determines the set of possible worlds.
Example 23.2.2. If we roll two (distinguishable) dice with six sides, then we have
36 possible worlds: (1,1), (2,1), . . . , (6,6).
We will restrict ourselves to a discrete, countable sample space. (others more
complicated, less useful in AI)
Convenience Notations:
By convention, we denote Boolean random variables with A, B, and more gen-
eral finite domain random variables with X, Y .
For a Boolean random variable Name, we write name for the outcome Name = T
and ¬name for Name = F. (Follows Russel/Norvig as well)
Probability Distributions
Definition 23.2.10. The probability distribution for a random variable X, written
P(X), is the vector of probabilities for the (ordered) domain of X.
Example 23.2.11. Probability distributions for finite domain and Boolean random
variables
Headache = T Headache = F
Weather = sunny P (W = sunny ∧ headache) P (W = sunny ∧ ¬headache)
Weather = rain
Weather = cloudy
Weather = snow
Definition 23.2.16.
Given random variables {X 1 , . . ., X n }, the full joint probability distribution, denoted
P(X 1 , . . ., X n ), lists the probabilities of all atomic events.
Observation:
Given random variables X 1 , . . ., X n with domains D1 , . . ., Dn , the full joint proba-
bility distribution is an n-dimensional array of size ⟨D1 , . . . ,Dn ⟩.
Example 23.2.17. P(Cavity, T oothache)
toothache ¬toothache
cavity 0.12 0.08
¬cavity 0.08 0.72
Note: All atomic events are disjoint (their pairwise conjunctions all are equivalent
to F ); the sum of all fields is 1 (the disjunction over all atomic events is T ).
The role of clause 2 in Definition 23.2.18 is for P to “make sense”: intuitively, the probability
weight of a formula should be the sum of the weights of the interpretations satisfying it. Imagine
this was not so; then, for example, we could have P (A) = 0.2 and P (A ∧ B) = 0.8. The role of
23.3. CONDITIONAL PROBABILITIES 465
How to derive from (i), (ii’), and (iii) that, for all propositions A, P (¬a) = 1−P (a)?
Answer: reserved for the plenary sessions ; be there!
Believing in Kolmogorov?
Reminder 1: (i) P (⊤) = 1; (ii’) P (a ∨ b) = P (a) + P (b) − P (a ∧ b).
Reminder 2: “Probabilities model our belief.”
If P represents an objectively observable probability, the axioms clearly make
sense.
But why should an agent respect these axioms, when modeling its subjective
own belief?
Question: Do you believe in Kolmogorow’s axioms?
Answer: reserved for the plenary sessions ; be there!
you are informed that your current train has 30 minutes delay.
Example 23.3.2. The “probability of cavity” increases when the doctor is informed
that the patient has a toothache.
P (a ∧ b)
P (a|b):=
P (b)
Intuition: The likelihood of having a and b, within the set of outcomes where we
have b.
Example 23.3.6. P (cavity ∧ toothache) = 0.12 and P (toothache) = 0.2 yield
P (cavity|toothache) = 0.6.
Headache = T Headache = F
Weather = sunny P (W = sunny|headache) P (W = sunny|¬headache)
Weather = rain
Weather = cloudy
Weather = snow
23.4 Independence
A Video Nugget covering this section can be found at https://fau.tv/clip/id/29050.
toothache ¬toothache
cavity 0.12 0.08
¬cavity 0.08 0.72
Answer: No:
Given n random variables with k values each, the full joint probability distribution
contains k n probabilities.
Computational cost of dealing with this size.
Practically impossible to assess all these probabilities.
Question: So, is there a compact way to represent the full joint probability distri-
bution? Is there an efficient method to work with that representation?
Answer: Not in general, but it works in many cases. We can work directly with
conditional probabilities, and exploit conditional independence.
Independence (Examples)
Example 23.4.4.
Independent Dependent
toothache ¬toothache
cavity 0.12 0.08
¬cavity 0.08 0.72
Adding variable Weather with values sunny, rain, cloudy, snow, the full joint prob-
ability distribution contains 16 probabilities.
But your teeth do not influence the weather, nor vice versa!
Weather is independent of each of Cavity and Toothache: For all value combi-
nations (c,t) of Cavity and Toothache, and for all values w of Weather, we have
P (c ∧ t ∧ w) = P (c ∧ t) · P (w).
P(Cavity, Toothache, Weather) can be reconstructed from the separate tables
P(Cavity, Toothache) and P(Weather). (8 probabilities)
The component wise array product from Definition 23.5.3 is something that Russell/Norvig (and
470 CHAPTER 23. QUANTIFYING UNCERTAINTY
the literature in general) glosses over and sweeps under the rug. The problem is that it is not a
real mathematical operator, that can be defined notation independently, because it depends on
the indices in the representation. But the notation is just too convenient to bypass.
It is just a coincidence that we can use the outer product in probability distributions P(X, Y ) =
P(X) · P(Y ). Here, the outer product and component wise array product co-incide.
Marginalization
Extracting a sub-distribution from a larger joint distribution:
Given sets X and Y of random variables, we have:
X
P(X) = P(X, y)
y∈Y
P
where y∈Y sums over all possible value combinations of Y.
We now come to a very important technique of computing unknown probabilities, which looks
almost like magic. Before we formally define it on the next slide, we will get an intuition by
considering it in the context of our dentistry example.
Normalization: Idea
Problem: We know P (cavity ∧ toothache) but don’t know P (toothache).
Step 1: Case distinction over values of Cavity: (P (toothache) as an unknown)
To understand what is going on, consider the situation in the following diagram:
toothache ¬toothache
cavity
¬cavity
472 CHAPTER 23. QUANTIFYING UNCERTAINTY
Normalization
Question: Say we know P (likeschappi ∧ dog) = 0.32 and P (¬likeschappi ∧ dog) =
0.08. Can we compute P (likeschappi|dog)? (Chappi =b popular dog food)
Answer: reserved for the plenary sessions ; be there!
Question: So what is P (likeschappi|dog)?
Normalization: Formal
Definition 23.5.8.
Pk
Given a vector ⟨w1 , . . ., wk ⟩ of numbers in [0,1] where i=1 wi ≤1, the normalization
constant α is α⟨w1 , . . ., wk ⟩:= Pk 1 w .
i=1 i
Note:
Pk
The condition i=1 w i ≤1 is needed because these will be relative weights, i.e.
case distinction over a subset of all worlds (the one fixed by the knowledge in our
conditional probability).
Example 23.5.9. α⟨0.12, 0.08⟩ = 5⟨0.12, 0.08⟩ = ⟨0.6, 0.4⟩.
Normalization: Formal
23.6. BAYES’ RULE 473
Bayes’ Rule
Definition 23.6.1 (Bayes’ Rule). Given propositions A and B where P (a) ̸= 0
and P (b) ̸= 0, we have:
P (b|a) · P (a)
P (a|b) =
P (b)
This equation is called Bayes’ rule.
Proof:
1. By definition, P (a|b) = PP(a∧b)
(b)
2. by the product rule P (a ∧ b) = P (b|a) · P (a) is equal to the claim.
Notation: This is a system of equations!
Doctor d′ knows P (m|s) from observation; she does not need Bayes’ rule!
Indeed, but what if a meningitis epidemic erupts
Then d knows that P (m|s) grows proportionally with P (m) (d′ clueless)
Conditional Independence
476 CHAPTER 23. QUANTIFYING UNCERTAINTY
So we have:
Cavity
Toothache Catch
2. Normalization+Marginalization:
P
P(X|e) = α · P(X, e); if Y ̸= ∅ then P(X|e) = α · ( y∈Y P(X, e, y))
3. Chain rule:
Order X 1 = Cavity, X 2 = Toothache, X 3 = Catch.
Thus:
P(Cavity|toothache, catch)
= α · P(catch|Cavity) · P(toothache|Cavity) · P(Cavity)
= α · ⟨0.9 · 0.6 · 0.2, 0.2 · 0.1 · 0.8⟩
= α · ⟨0.108, 0.016⟩
478 CHAPTER 23. QUANTIFYING UNCERTAINTY
Observation 23.7.6.
In a naive Bayes model, the full joint probability distribution can be written as
Y
P(cause|effect1 , . . ., effectn ) = α⟨effect1 , . . ., effectn ⟩·P(cause)· P(effecti |cause)
i
Note: This kind of model is called “naive” since it is often used as a simplifying
model if the effects are not conditionally independent after all.
It is also called idiot Bayes model by Bayesian fundamentalists.
In practice, naive Bayes models can work surprisingly well, even when the conditional
independence assumption is not true.
Example 23.7.7. The dentistry example is a (true) naive Bayes model.
Questionnaire
Consider the random variables X 1 = Animal, X 2 = LikesChappi, and X 3 =
LoudNoise, and X 1 has values {dog, cat, other}, X 2 and X 3 are Boolean.
Question: Which statements are correct?
(A) Animal is independent of LikesChappi.
(B) LoudNoise is independent of LikesChappi.
(C) Animal is conditionally independent of LikesChappi given LoudNoise.
(D) LikesChappi is conditionally independent of LoudNoise given Animal.
Think about this intuitively: Given both values for variable X, are the chances of Y
being true higher for one of these (fixing value of the third variable where specified)?
at Example 23.1.17 to understand whether logic was up to the job of guiding an agent in the
Wumpus cave.
Idea: Let’s evaluate our probabilistic reasoning machinery, if that can help!
We split the set of hidden variables into fringe and other variables: U = F ∪ O
where F is the fringe and O the rest.
Corollary 23.8.3. P (b|P 1,3 , κ, U ) = P (b|P 1,3 , κ, F ) (by conditional
independence)
Now: let us exploit this formula.
Wumpus: Reasoning
23.9. CONCLUSION 481
We calculate:
X
P (P 1,3 |κ, b) = α( P(P 1,3 , u, κ, b))
u∈U
X
= α( P(b|P 1,3 , κ, u) · P(P 1,3 , κ, u))
u∈U
XX
= α( P(b|P 1,3 , κ, f , o) · P(P 1,3 , κ, f , o))
f ∈F o∈O
X X
= α( P(b|P 1,3 , κ, f ) · ( P(P 1,3 , κ, f , o)))
f ∈F o∈O
X X
= α( P(b|P 1,3 , κ, f ) · ( P(P 1,3 ) · P (κ) · P (f ) · P (o)))
f ∈F o∈O
X X
= αP(P 1,3 )P (κ)( P(b|P 1,3 , κ, f ) · P (f ) · ( P (o)))
f ∈F o∈O
α′ P (P 1,3 )(
X
= P(b|P 1,3 , κ, f ) · P (f ))
f ∈F
Wumpus: Solution
We calculate using the product X rule and conditional independence (see above)
′
P (P 1,3 |κ, b) = α · P (P 1,3 ) · ( P(b|P 1,3 , κ, f ) · P (f ))
f ∈F
Let us explore possible models (values) of Fringe that are F compatible with ob-
servation b.
P(P 1,3 |κ, b) = α′ · ⟨0.2 · (0.04 + 0.16 + 0.16), 0.8 · (0.04 + 0.16)⟩ = ⟨0.31, 0.69⟩
23.9 Conclusion
A Video Nugget covering this section can be found at https://fau.tv/clip/id/29056.
482 CHAPTER 23. QUANTIFYING UNCERTAINTY
Summary
Uncertainty is unavoidable in many environments, namely whenever agents do not
have perfect knowledge.
Probabilities express the degree of belief of an agent, given its knowledge, into an
event.
Conditional probabilities express the likelihood of an event given observed evidence.
Assessing a probability =
b use statistics to approximate the likelihood of an event.
Bayes’ rule allows us to derive, from probabilities that are easy to assess, probabil-
ities that aren’t easy to assess.
Given multiple evidence, we can exploit conditional independence.
24.1 Introduction
A Video Nugget covering this section can be found at https://fau.tv/clip/id/29218.
Cavity
Toothache Catch
2. Normalization+Marginalization:
X
P(X|e) = αP(X, e) = α P(X, e, y)
y∈Y
483
484 CHAPTER 24. PROBABILISTIC REASONING: BAYESIAN NETWORKS
Some Applications
A ubiquitous problem: Observe “symptoms”, need to infer “causes”.
Medical Diagnosis Face Recognition
Question: Given that both John and Mary call me, what is the probability of a
burglary?
P(E)
P(B)
Burglary .001
Earthquake .002
B E P(A)
T T .95
Alarm T F .94
F T .29
F F .001
A P(J)
A P(M)
T .90
JohnCalls F .05 MaryCalls T
F
.70
.01
Note:
In each P(X i |Parents(X i )), we show only P(X i = T|Parents(X i )). We don’t show
P(X i = F|Parents(X i )) which is 1 − P(X i = T|Parents(X i )).
P(E)
P(B)
Burglary .001
Earthquake .002
B E P(A)
T T .95
Alarm T F .94
F T .29
F F .001
A P(J)
A P(M)
T .90
JohnCalls F .05 MaryCalls T
F
.70
.01
24.3. WHAT IS THE MEANING OF A BAYESIAN NETWORK? 487
Burglary Earthquake
Alarm
JohnCalls MaryCalls
U1 ... Um
X
Z1j Z nj
Y1 ... Yn
Burglary Earthquake
Alarm
JohnCalls MaryCalls
B E P(A)
T T .95
Alarm T F .94
F T .29
F F .001
A P(J)
A P(M)
T .90
JohnCalls F .05 MaryCalls T
F
.70
.01
24.3. WHAT IS THE MEANING OF A BAYESIAN NETWORK? 489
Chain Rule:
For any ordering X 1 , . . ., X n , we have:
With Definition 24.3.3 (A), we can use P(X i |Parents(X i )) instead of P(X i |X i−1 , . . . ,X 1 ):
n
Y
P(X 1 , . . ., X n ) = P(X i |Parents(X i ))
i=1
Note:
If there is a cycle, then any ordering X 1 , . . ., X n will not be consistent with the BN; so in the
chain rule on X 1 , . . ., X n there comes a point where we have P(X i |X i−1 , . . . , X 1 ) in the chain but
P(X i |Parents(X i )) in the definition of distribution, and Parents(X i )̸⊆{X i−1 , . . . ,X 1 } but then
the products are different. So the chain rule can no longer be used to prove that we can reconstruct
the full joint probability distribution. In fact, cyclic Bayesian network contain ambiguities (several
interpretations possible) and may be self-contradictory (no probability distribution matches the
Bayesian network).
490 CHAPTER 24. PROBABILISTIC REASONING: BAYESIAN NETWORKS
P(E)
P(B)
Burglary .001
Earthquake .002
B E P(A)
T T .95
Alarm T F .94
F T .29
F F .001
A P(J)
A P(M)
T .90
JohnCalls F .05 MaryCalls T
F
.70
.01
Animal
LoudNoise LikesChappi
Say B is the Bayesian network above. Which statements are correct?
Note: For ?? we try to determine whether – given different value assignments to potential parents
– the probability of Xi being true differs? If yes, we include these parents. In the particular case:
Again: Given different value assignments to potential parents, does the probability of Xi being
true differ? If yes, include these parents.
1. M to J as before.
2. M, J to E as probability of E is higher if M/J is true.
3. Same for B; E to B because, given M and J are true, if E is true as well then prob of B is
lower than if E is false.
4. M /J/B/E to A because if M /J/B/E is true (even when changing the value of just one of
these) then probability of A is higher.
Note: size(B) =
b The total number of entries in the CPTs.
Qn
In the worst case, size(B) = n · i=1 #(Di ), namely if every variable depends on
all its predecessors in the chosen order.
Intuition: BNs are compact if each variable is directly influenced only by few of
its predecessor variables.
If
Qwe model Fever as a noisy disjunction node, then the general rule P (X i |Parents(X i )) =
{j|X j =T} qj for the CPT gives the following table:
P(E)
P(B)
Burglary .001
Earthquake .002
B E P(A)
T T .95
Alarm T F .94
F T .29
F F .001
A P(J)
A P(M)
T .90
JohnCalls F .05 MaryCalls T
F
.70
.01
What is P(Burglary|johncalls)?
What is P(Burglary|johncalls, marycalls)?
2. Normalization+Marginalization:
P
P(X|e) = αP(X, e); if Y ̸= ∅ then P(X|e) = α( y∈Y P(X, e, y))
3. Chain Rule:
Order X 1 , . . ., X n consistent with B.
B E P(A)
T T .95
Alarm T F .94
F T .29
F F .001
A P(J)
A P(M)
T .90
JohnCalls F .05 MaryCalls T
F
.70
.01
498 CHAPTER 24. PROBABILISTIC REASONING: BAYESIAN NETWORKS
Order: X 1 = B, X 2 = E, X 3 = A, X 4 = J, X 5 = M .
Note: This step is actually done by the pseudo-code, implicitly in the sense that
in the recursive calls to enumerate-all we multiply our own prob with all the rest.
That is valid because, the variable ordering being consistent, all our parents are
already here which is just another way of saying “my own prob does not depend on
the variables in the rest of the order”.
The probabilities of the outside-variables multiply the entire “rest of the sum”
Chain rule and conditional independence, ctd.:
P(B|j, m)
X X
= αP(B)( P (vE )( P(vA |B, vE )P (j|vA )P (m|vA )))
vE vA
a
z }| {
P (a|b, e)P (j|a)P (m|a)
P (e) · e
+ P (¬a|b, e)P (j|¬a)P (m|¬a)
| {z }
¬a
= α · P (b) ·
a
z }| {
P (a|b, ¬e)P (j|a)P (m|a)
+ P (¬e) · ¬e
+ P (¬a|b, ¬e)P (j|¬a)P (m|¬a)
| {z }
¬a
= α⟨0.00059224, 0.0014919⟩ ≈ ⟨0.284, 0.716⟩
Inference by enumeration = a tree with “sum nodes” branching over values of hidden
variables, and with non-branching “multiplication nodes”.
2. Then the computation is performed in terms of factor product and summing out
variables from factors:
X X
P(B|j, m) = α · f1 (B) · ( f2 (E) · ( f3 (A, B, E) · f4 (A) · f5 (A)))
vE vA
So?: Life goes on . . . In the hard cases, if need be we can throw exactitude to
the winds and approximate.
Example 24.5.8. Sampling techniques as in MCTS.
24.6 Conclusion
A Video Nugget covering this section can be found at https://fau.tv/clip/id/29228.
Summary
Bayesian networks (BN) are a wide-spread tool to model uncertainty, and to reason
24.6. CONCLUSION 501
Reading:
• Chapter 14: Probabilistic Reasoning of [RN03].
– Section 14.1 roughly corresponds to my “What is a Bayesian Network?”.
– Section 14.2 roughly corresponds to my “What is the Meaning of a Bayesian Network?” and
“Constructing Bayesian Networks”.The main change I made here is to define the semantics
of the BN in terms of the conditional independence relations, which I find clearer than RN’s
definition that uses the reconstructed full joint probability distribution instead.
– Section 14.4 roughly corresponds to my “Inference in Bayesian Networks”. RN give full details
on variable elimination, which makes for nice ongoing reading.
– Section 14.3 discusses how CPTs are specified in practice.
502 CHAPTER 24. PROBABILISTIC REASONING: BAYESIAN NETWORKS
25.1 Introduction
A Video Nugget covering this section can be found at https://fau.tv/clip/id/30338.
Decision Theory
Definition 25.1.1. Decision theory investigates decision problems, i.e. how an
agent a deals with choosing among actions based on the desirability of their out-
comes given by a real-valued utility function u on states s∈S: i.e. u : S→R.
fully observable, iff the A’s sensors give it access to the complete state of the
environment at any point in time, else partially observable.
deterministic, iff the next state of the environment is completely determined by
the current state and A’s action, else stochastic.
episodic, iff A’s experience is divided into atomic episodes, where it perceives
and then performes a single action. Crucially the next episode does not depend
on previous ones. Non-episodic environments are called sequential.
For now: We restrict ourselves to episodic decision theory, which deals with
choosing among actions based on the desirability of their immediate outcomes. (no
need to treat time explicitly)
Later:
We will study sequential decision problems, where the agent’s utility depends on a
sequence of decisions. (chapter 27)
Utility-based agents
503
504 CHAPTER 25. MAKING SIMPLE DECISIONS RATIONALLY
Definition 25.1.2. A utility based agent uses a world model along with a utility
function that models its preferences among the states of that world. It chooses the
action that leads to the best expected utility.
54 Chapter 2. Intelligent Agents
Agent Schema:
Sensors
State
What the world
How the world evolves is like now
Environment
What it will be like
What my actions do if I do action A
What action I
should do now
Agent Actuators
Figure 2.14 A model-based, utility-based agent. It uses a model of the world, along with
a utility function that measures its preferences among states of the world. Then it chooses the
action that leadsMichael
to theKohlhase: Artificial Intelligence 2
best expected 819
utility, where expected 2023-02-10
utility is computed by averaging
over all possible outcome states, weighted by the probability of the outcome.
Decision networks
Value of information
Example 25.2.1. I have to decide whether to go to class today (or sleep in). What
is the utility of this lecture. (obviously 42)
Idea: We can let people/agents choose between two states! (subjective
preference)
Example 25.2.2. Give me your cell-phone or I will give you a bloody nose. ;
To make a decision in a deterministic environment, the agent must determine
whether it prefers a state without phone to one with a bloody nose?
Definition 25.2.3.
Given states A and B (we call them prizes) and agent can express preferences of
the form
Definition 25.2.5.
Given
Pn prizes Ai and probabilities pi with
i=1 pi = 1, a lottery [p1 ,A1 ;. . .;pn ,An ] repre-
p A
sents the result of a nondeterministic action that L
can have outcomes Ai with prior probability pi . 1−p B
For the binary case, we use [p,A;1−p,B].
Rational Preferences
Idea: Preferences of a rational agent must obey constraints:
Rational preferences ; behavior describable as .
Definition 25.2.6. We call a set ≻ of preferences rational, iff the following con-
straints hold:
Orderability A≻B ∨ B≻A ∨ A∼B
Transitivity A≻B ∧ B≻C ⇒ A≻C
Continuity A≻B≻C ⇒ (∃p [p,A;1−p,C]∼B)
Substitutability A∼B ⇒ [p,A;1−p,C]∼[p,B;1−p,C]
Monotonicity A≻B ⇒ (p>q) ⇔ [p,A;1−p,B]≻[q,A;1−q,B]
Decomposability [p,A;1−p,[q,B;1−q,C]]∼[p,A ; ((1 − p)q),B ; ((1 − p)(1 − q)),C]
Orderability: A≻B ∨ B≻A ∨ A∼B Given any two prizes or lotteries, a rational agent must either prefer one
to the other or else rate the two as equally preferable. That is, the agent cannot avoid deciding.
Refusing to bet is like refusing to allow time to pass.
Continuity: A≻B≻C ⇒ (∃p [p,A;1−p,C]∼B) If some lottery B is between A and C in preference, then there
is some probability p for which the rational agent will be indifferent between getting B for sure
and the lottery that yields A with probability p and C with probability 1 − p.
Substitutability: A∼B ⇒ [p,A;1−p,C]∼[p,B;1−p,C] If an agent is indifferent between two lotteries A and B, then
the agent is indifferent between two more complex lotteries that are the same except that B
is substituted for A in one of them. This holds regardless of the probabilities and the other
outcome(s) in the lotteries.
Monotonicity: A≻B ⇒ (p>q) ⇔ [p,A;1−p,B]≻[q,A;1−q,B] Suppose two lotteries have the same two possible
outcomes, A and B. If an agent prefers A to B, then the agent must prefer the lottery that has
a higher probability for A (and vice versa).
p
A p A
(1 − p)q
q
B B
1−p
(1 − p)(1 − q)
C
1−q
C
Observation: With deterministic prizes only (no lottery choices), only a total
ordering on prizes can be determined.
Definition 25.3.2. We call a total ordering on states a value function or ordinal
utility function.
Idea: Apply this idea to get the expected utility of an action, this is stochastic:
In partially observable environments, we do not know the current state.
In nondeterministic environments, we cannot be sure of the result of an action.
Definition 25.3.4. Let A be an agent with a set Ω of states and a utility function
U : Ω→R+ 0 , then for each action a, we define a random variable Ra whose values
are the results of performing a in the current state.
Definition 25.3.5. The expected utility EU(a|e) of an action a (given evidence e)
is X
EU(a|e):= P (Ra = s|a, e) · U (s)
s∈Ω
Utilities
Intuition: Utilities map states to real numbers.
Question: Which numbers exactly?
Measuring Utility
Definition 25.3.8. Normalized utilities: u⊤ = 1, u⊥ = 0.
Ask them directly: What would you pay to avoid playing Russian roulette with
a million-barrelled revolver? (very large numbers)
But their behavior suggests a lower price:
Driving in a car for 370km incurs a risk of one micromort;
Over the life of your car – say, 150, 000km that’s 400 micromorts.
People appear to be willing to pay about 10, 000€ more for a safer car that
halves the risk of death. (; 25€ per micromort)
This figure has been confirmed across many individuals and risk types.
Of course, this argument holds only for small risks. Most people won’t agree to
kill themselves for 25M€.
Definition 25.3.10. QALYs: quality adjusted life years
Application: QALYs are useful for medical decisions involving substantial risk.
Typical empirical data, extrapolated with risk prone behavior for debitors:
25.4. MULTI-ATTRIBUTE UTILITY 511
Strict Dominance
Typically define attributes such that U is monotone in each argument. (wlog.
growing)
Definition 25.4.3. Choice B strictly dominates choice A iff Xi (B) ≥ Xi (A) for
all i (and hence U (B)≥U (A))
Stochastic Dominance
Definition 25.4.4.
A distribution p2 stochastically dominates distribution p1 iff the cummulative dis-
25.4. MULTI-ATTRIBUTE UTILITY 513
Z
−∞ Z
−∞
p1 (x)dx≤ p2 (x)dx
t t
Example 25.4.5.
Z
−∞ Z
−∞
Example 25.4.7.
Construction cost increases with distance from city S 1 is closer to the city than S 2
; S 1 stochastically dominates S 2 on cost.
Example 25.4.8. Injury increases with collision speed.
Idea: Annotate Bayesian networks with stochastic dominance information.
Definition 25.4.9.
+
X →Y (X positively influences Y ) means that P(Y |X 1 , z) stochastically dominates
P(Y |X 2 , z) for every value z of Y ’s other parents Z and all X 1 and X 2 with
X 1 ≥X 2 .
U (X 1 , . . ., X n ) = F (f 1 (X 1 ), . . . , f n (f n )X n )
U = k 1 U 1 +k 2 U 2 +k 3 U 3 +k 1 k 2 U 1 U 2 +k 2 k 3 U 2 U 3 +k 3 k 1 U 3 U 1 +k 1 k 2 k 3 U 1 U 2 U 3
System Support: Routine procedures and software packages for generating pref-
erence tests to identify various canonical families of utility functions.
Sensors
State
What the world
How the world evolves is like now
Environment
What action I
should do now
Agent Actuators
Figure 2.14 A model-based, utility-based agent. It uses a model of the world, along with
a utility function that measures its preferences among states
Michael Kohlhase: Artificial Intelligence 2 843
of the world.
2023-02-10
Then it chooses the
action that leads to the best expected utility, where expected utility is computed by averaging
overalready
As we all possible outcomenetworks
use Bayesian states, weighted by the probability
for the world/belief model,ofintegrating
the outcome. utilities and pos-
sible actions into the network suggests itself naturally. This leads to the notion of a decision
network.
outcome. (Appendix A defines expectation more precisely.) In Chapter 16, we show that any
rational agent must behave as if it possesses a utility function whose expected value it tries
to maximize. An agent that possesses an explicit utility function can make rational decisions
518 CHAPTER 25. MAKING SIMPLE DECISIONS RATIONALLY
Decision networks
Definition 25.5.1. A decision network is a Bayesian network with added action
nodes and utility nodes (also called value node) that enable decision making.
Algorithm:
For each value of action node
compute expected value of utility node given action, evidence
Return MEU action (via argmax)
So far we have tacitly been concentrating on actions that directly affect the environment. We
will now come to a type of action we have hypothesized in the beginning of the course, but have
completely ignored up to now: information acquisition actions.
So, we should pay up to k/n€ for the information. (as much as block 3 is
worth)
Idea: So we must compute the expected gain over all possible values f ∈D.
Definition 25.6.5. Let F be a random variable with domain D, then the value of
perfect information (VPI) on F given evidence E is defined as
X
VPIE (F ):=( P (F = f |E) · EU(αf |E, F = f )) − EU(α|E)
f ∈D
Properties of VPI
Note:
When more than one piece of evidence can be gathered,
maximizing VPI for each to select one is not always optimal
; evidence-gathering becomes a sequential decision problem.
We will now use information value theory to specialize our utility-based agent from above.
Particle filtering?
Further algorithms and Topics?
523
524 CHAPTER 26. TEMPORAL PROBABILITY MODELS
Definition 26.1.4 (Basic Setup). A temporal probability model has two sets of
random variables indexed by N.
Xt = b set of (unobservable) state variables at time t≥0
e.g., BloodSugart , StomachContentst , etc.
Et = b set of (observable) evidence variables at time t>0
e.g., MeasuredBloodSugart , PulseRatet , FoodEatent
Notation: Xa:b = Xa , Xa+1 , . . . , Xb−1 , Xb
Markov Processes
Idea: Construct a Bayesian network from these variables. (parents?)
Definition 26.1.8. We say that a Markov process has the nth order Markov
property for n∈N+ , iff P(Xt |X0:t−1 ) = P(Xt |Xt−n:t−1 ). Special Cases
First-order Markov property: P(Xt |X0:t−1 ) = P(Xt |Xt−1 )
Intuition: Increasing the order adds “memory” to the process, Markov chains have
none.
Preview: We will use Markov processess to model sequential environments.
26.1. MODELING TIME AND UNCERTAINTY 525
Possible fixes:
1. Increase the order of the Markov process. (more dependencies)
2. Add state variables, e.g., add Tempt , Pressuret . (more information sources)
We will see the second in another example: tracking robot motion.
Vt−1 Vt Vt+1
Xt−1 Xt Xt+1
Zt−1 Zt Zt+1
Idea: We can restore the Markov property by including a state variable for the
charge level Bt . (Better still: Battery level sensor)
526 CHAPTER 26. TEMPORAL PROBABILITY MODELS
Mt−1 Mt Mt+1
Bt−1 Bt Bt+1
Vt−1 Vt Vt+1
Xt−1 Xt Xt+1
Zt−1 Zt Zt+1
Problem: Even with Markov property the transition model is infinite. (t∈N)
Definition 26.1.14. A Markov chain is called stationary if the transition model is
independent of time, i.e. P(Xt |Xt−1 ) is the same for all t.
Example 26.1.15 (Umbrellas are stationary). P(Rt |Rt−1 ) does not depend on
t. (need only one table)
Inference tasks
Definition 26.2.1. The Markov inference tasks consist of filtering, prediction,
smoothing, and most likely explanation as sdefined below.
Definition 26.2.2. Filtering (or monitoring): P(Xt |e1:t )
528 CHAPTER 26. TEMPORAL PROBABILITY MODELS
computing the belief state input to the decision process of a rational agent.
Definition 26.2.3. Prediction (or state estimation): P(Xt+k |e1:t ) for k > 0
evaluation of possible action sequences. (=
b filtering without the evidence)
Definition 26.2.4. Smoothing (or hindsight): P(Xk |e1:t ) for 0 ≤ k < t
better estimate of past states. (essential for learning)
Definition 26.2.5. Most likely explanation: argmax (P (x1:t |e1:t ))
x1:t
speech recognition, decoding with a noisy channel.
Note: P(et+1 |Xt+1 ) can be obtained directly from the sensor model.
Continue by conditioning on the current state Xt :
P(Xt+1 |e1:t+1 )
X
= α · P(et+1 |Xt+1 ) · ( P(Xt+1 |xt , e1:t ) · P (xt |e1:t ))
xt
X
= α · P(et+1 |Xt+1 ) · ( P(Xt+1 |xt ) · P (xt |e1:t ))
xt
P(Xt+1 |Xt ) is simply the transition model, P (xt |e1:t ) the “recursive call”.
So f 1:t+1 = α · FORWARD(f 1:t , et+1 ) where f 1:t = P(Xt |e1:t ) and FORWARD is
the update shown above. (Time and space constant (independent of t))
Proof sketch: Using the same reasoning as for the FORWARD algorithm for filter-
ing.
Observation 26.2.8. As k → ∞, P (xt+k |e1:t ) tends to the stationary distribution
of the Markov chain, i.e. the a fixed point under prediction.
Intuition: The mixing time, i.e. the time until prediction reaches the stationary
distribution depends on how “stochastic” the chain is.
Smoothing
Smoothing estimates past states by computing P(Xk |e1:t ) for 0 ≤ k < t
Divide evidence e1:t into e1:k (before k) and ek+1:t (after k):
Smoothing (continued)
Backward message bk+1:t = P(ek+1:t |Xk ) computed by a backwards recursion:
X
P(ek+1:t |Xk ) = P(ek+1:t |Xk , xk+1 ) · P(xk+1 |Xk )
xk+1
X
= P (ek+1:t |xk+1 ) · P(xk+1 |Xk )
xk+1
X
= P (ek+1 , ek+2:t |xk+1 ) · P(xk+1 |Xk )
xk+1
X
= P (ek+1 |xk+1 ) · P (ek+2:t |xk+1 ) · P(xk+1 |Xk )
xk+1
P (ek+1 |xk+1 ) and P(xk+1 |Xk ) can be directly obtained from the model, P (ek+2:t |xk+1 )
is the “recursive call” (bk+2:t ).
In message notation: bk+1:t = BACKWARD(bk+2:t , ek+1:t ) where BACKWARD
is the update shown above. (time and space constant (independent of t))
Smoothing example
Example 26.2.9 (Smoothing Umbrellas). Umbrella appears on days 1/2.
P(R1 |u1 , u2 ) = α · P(R1 |u1 ) · P(u2 |R1 ) = α · ⟨0.818, 0.182⟩ · P(u2 |R1 )
Compute P(u2 |R1 ) by backwards recursion:
X
P(u2 |R1 ) = P (u2 |r2 ) · P (|r2 ) · P(r2 |R1 )
r2
= 0.9 · 1 · ⟨0.7, 0.3⟩ + 0.2 · 1 · ⟨0.3, 0.7⟩ = ⟨0.69, 0.41⟩
Time complexity linear in t (polytree inference), Space complexity O(t · #(f )).
Most likely path to each xt+1 = most likely path to some xt plus one more step
I.e., m1:t (i) gives the probability of the most likely path to state i.
Update has sum replaced by max, giving the Viterbi algorithm:
Observation 26.2.13. Viterbi has linear time complexity (like filtering), but linear
space complexity (needs to keep a pointer to most likely sequence leading to each
state).
Viterbi example
Example 26.2.14 (Viterbi for Umbrellas). View the possible state sequences for
Raint as paths through state graph.
To find “most likely sequence”, follow bold arrows back from “most likely state m1:5 .
HMM Algorithm
Idea: The forward and backward messages are column vectors in HMMs.
For instance, the probability that the sensor on a square with obstacles in north
3
and south would produce N S E is (1 − ϵ) · ϵ1 .
26.3. HIDDEN MARKOV MODELS 535
Idea: Use the HMM filtering equation f 1:t+1 = α · Ot+1 Tt f 1:t for localization.
(next)
Idea: Use HMM filtering equation f 1:t+1 = α · Ot+1 Tt f 1:t to compute posterior
distribution over locations. (i.e. robot localization)
66 11
5.5
5.5 ==0.20
0.20
0.9
==0.10
0.10 0.9
55 = 0.05 0.8
= 0.05 0.8
Localization error
4.5
Localization error
4.5 ==0.02
0.02
accuracy
0.7
Path accuracy
44 0.7
==0.00
0.00
3.5
3.5 0.6
0.6 ==0.00
0.00
33 0.5
0.5 ==0.02
0.02
2.5
Path
2.5 0.4
0.4 ==0.05
0.05
22 ==0.10
0.10
0.3
0.3 = 0.20
= 0.20
1.5
1.5
11 0.2
0.2
0.5
0.5 0.1
0.1
00 5 5 1010 1515 2020 25 25 3030 35
35 40
40 00 55 10 15
10 15 20 20 25 25 3030 35
35 40
40
Numberofofobservations
Number observations Numberofofobservations
Number observations
taken to get where it is now. Figure 15.8 shows the localization error and Viterbi path accuracy
taken to get where it is now. Figure 15.8 shows the localization error and Viterbi path accuracy
Country
for
dance
forvarious
variousvalues
algorithm
valuesofofthe
theper-bit
per-bitsensor
sensorerror
errorrate
rate !.!. Even
Evenwhen
when!!isis20%—which
20%—which means
means that
that
theoverall
the overallsensor
sensorreading
readingisiswrong
wrong59%
59%ofofthe
thetime—the
time—therobot
robotisisusually
usuallyable
ableto
towork
workout
outits
its
location
Idea:
location We within two
cantwo
within squares
avoid storing
squares afterall
after 25forward
25 observations. Thisisin
messages
observations. This isbecause
because
smoothingofthe
of thebyalgorithm’s ability
running ability
algorithm’s
totointegrate
forwardintegrate evidence
algorithm
evidence overtime
timeand
backwards:
over andtototake
takeinto
intoaccount
accountthe
theprobabilistic
probabilisticconstraints
constraints imposed
imposed
onthe
on thelocation
locationsequence
sequence by bythethetransition
transition model.
model. WhenWhen !! isis 10%,
10%, the the performance
performance after after
aahalf-dozen
half-dozenobservations
observationsisishard hardtotodistinguish
distinguish from the· performance
performance with perfect
perfect sensing.
sensing.
t
f 1:t+1 from = the α O t+1 T f 1:twith
Exercise15.7
Exercise 15.7asks
asksyou
youtotoexplore
explorehow how robustthe the HMMlocalization
localization algorithm is to errors in
Ot+1 −1robust
f 1:t+1 HMM = α · Tt f 1:t algorithm is to errors in
the prior distribution P(X
the prior distribution P(X0 )t −1 0 ) and in the transition model itself.
and in the transition model itself. Broadly Broadlyspeaking,
speaking, high
high levels
levels
ofoflocalization
localizationandandpath
path
α·T ′
accuracy
accuracy Ot+1are−1
are maintained
f 1:t+1 even
maintained =evenfinin
1:t theface
the faceof ofsubstantial
substantial errors
errors in
in the
the
modelsused.
models used.
Thestate
statevariable
variable for for thethe example
example we
The
Algorithm: Forward pass computes fwe have considered
have considered inin this this section
section isis aa physical
1:t , backward pass does f 1:i , bt−i:t .
physical
location in the world. Other problems can, of course, include
location in the world. Other problems can, of course, include other aspects of the world. other aspects of the world.
Exercise 15.8 asks you to consider a version of the vacuum robot that
Exercise 15.8 asks you to consider a version of the vacuum robot that has the policy of going has the policy of going
straightfor
straight forasaslong
longasasititcan;
can;only onlywhen
whenititencounters
encounters an an obstacle
obstacle does
does itit change
change toto aa new
new
(randomlyselected)
(randomly selected) heading.
heading. To To model
model thisthis robot,
robot, each
each state
state inin the
the model
model consists
consists ofof aa
(location,heading)
(location, heading)pair.pair. For
Forthe theenvironment
environment ininFigureFigure15.7, 15.7, which
which has has 4242 empty
empty squares,
squares,
this leads to 168 states and a transition matrix with 16822 = 28, 224 entries—still a manageable
this leads to 168 states and a transition matrix with 168 = 28, 224 entries—still a manageable
number.IfIfwe
number. weaddaddthe
thepossibility
possibilityofofdirt dirtininthe
thesquares,
squares,the thenumber
numberof ofstates
statesisismultiplied
multiplied by by
2 42 and the transition matrix ends up with more than 1029
42 29 entries—no longer a manageable
2 and the transition matrix ends up with more than 10 entries—no longer a manageable
number;Section
number; Section15.515.5shows
showshow howtotouseusedynamic
dynamicBayesian
Bayesiannetworks
networksto tomodel
modeldomains
domains withwith
many state variables. If we allow the robot to move continuously
many state variables. If we allow the robot to move continuously rather than in a discrete rather than in a discrete
grid,the
grid, thenumber
numberofofstates
statesbecomes
becomesinfinite;
infinite;thethenext
nextsection
sectionshowsshowshow howto tohandle
handlethis
thiscase.
case.
26.4. DYNAMIC BAYESIAN NETWORKS 537
Observation: Backwards pass only stores one copy of f 1:i , bt:t−i ; constant
space.
Problem: Algorithm is severely limited: transition matrix must be invertible and
sensor matrix cannot have zeroes – that is, that every observation be possible in
every state.
Definition 26.4.5 (Naive method). Unroll the network and run any exact algo-
rithm.
Summary
Temporal probability models use state and evidence variables replicated over time.
Markov property and stationarity assumption, so we need both
a transition model and P(Xt |Xt−1 )
a sensor model P(Et |Xt ).
Tasks are filtering, prediction, smoothing, most likely sequence; (all done
recursively with constant cost per time step)
Hidden Markov models have a single discrete state variable; (used for speech
recognition)
DBNs subsume HMMs, exact update intractable.
Particle filtering is a good approximate filtering algorithm for DBNs.
A Video Nugget covering the introduction to this chapter can be found at https://fau.tv/
clip/id/30356. We will now pick up the thread from chapter 25 but using temporal models
instead of simply probabilistic ones. We will first look at a sequential decision theory in the special
case, where the environment is stochastic, but fully observable (Markov decision processes) and
then lift that to obtain POMDPs and present an agent design based on that.
Outline
Markov decision processes (MDPs) for sequential environments.
541
542 CHAPTER 27. MAKING COMPLEX DECISIONS
Search
explicit actions uncertainty
and subgoals and utility
Solving MDPs
Recall: In search problems, the aim is to find an optimal sequence of actions.
In MDPs, the aim is to find an optimal policy π(s) i.e., best action for every
possible state s. (because can’t predict where one will end up)
Note: When you run against a wall, you stay in your square.
+1 +1
+1 +1 +1 +1
1
–1 –1 –1 –1
3 +1
1 2 3 4
R(s) < –1.6284 – 0.4278 < R(s) < – 0.0850 – 0.0221 < R(s) < 0 R(s) > 0
2 –1
(a) (b)
Question: Explain what you see in a qualitative manner!
+1 +1
1
–1 –1
Answer: reserved for the plenary sessions ; be there!
1 2 3 4
Theorem 27.2.2. For stationary preferences, there are only two ways to combine
rewards over time.
additive rewards: U ([s0 , s1 , s2 , . . .]) = R(s0 ) + R(s1 ) + R(s2 ) + · · ·
discounted rewards: U ([s0 , s1 , s2 , . . .]) = R(s0 )+γR(s1 )+γ 2 R(s2 )+· · · where
γ is called discount factor.
Possible Solutions:
1. Finite horizon: terminate utility computation at a fixed time T
Utility of States
b expected (discounted) sum of rewards (until
Intuition: Utility of a state =
termination) assuming optimal actions.
Definition 27.2.5. Given a policy π, let st be the state the agent reaches at time
t starting at state s0 . Then the expected utility obtained by executing π starting in
s is given by "∞ #
X
π t
U (s):=E γ R(st )
t=0
we define π ∗s :=argmax π
U (s).
π
Question: Why do we go left in (3, 1) and not up? (follow the utility)
546 CHAPTER 27. MAKING COMPLEX DECISIONS
expected sum of rewards = current reward + γ · exp. reward sum after best action
function VALUE
function V ALUE-I-ITERATION
TERATION(mdp, (mdp, !) returns aa utility
!) returns utility function
function
inputs: mdp, an
inputs: mdp, an MDP
MDP with with states
states SS,, actions
actions A(s), transitionmodel
A(s), transition modelPP(s (s! !||s,s,a),
a),
rewards R(s), discount
rewards R(s), discount γ γ
the maximum
!, the
!, maximum error error allowed
allowed in in the
the utility
utility of
ofany
anystate
state
27.3. VALUE/POLICY
local ITERATION
local variables:
variables: U U !!,, vectors
U ,, U vectors of
of utilities
utilities for
for states
states in
in SS,,initially
initiallyzero
zero 547
the maximum
δ, the
δ, maximum change
change in in the
the utility
utility of
ofany
anystate
statein
inan
aniteration
iteration
P
Remark: ′ ′
repeat Retrieve the optimal policy with π[s]:=argmax ( s′ U [s ] · P (s |s, a))
repeat
U←
U ←U U !!;; δδ ←
← 00 a
for each state s in S do !
!
U !! [s] ← R(s) + γ max max PP(s(s!! ||s,
s,a)
a) U [s!!]]
U[s
Michael Kohlhase: Artificial Intelligence 2
∈ A(s)
aa ∈ A(s) 896 2023-02-10
!
ss!
!
then δδ ←
if |U ! [s] − U [s]| > δδ then |U!![s]
←|U −U
[s] − U[s]|
[s]|
until δ < !(1 − γ)/γ
Value Iteration Algorithm (Example)
return U
1e+07
1e+07
1 (4,3)
(4,3) cc==0.0001
0.0001
(3,3)
(3,3) 1e+06
1e+06 cc==0.001
0.001
0.8 cc==0.01
0.01
(1,1)
(1,1)
Iterations required
cc==0.1
0.1
required
100000
100000
Utility estimates
0.6 (3,1)
(3,1)
10000
10000
0.4 (4,1)
(4,1)
Iterations
1000
1000
0.2
100
100
0
10
10
-0.2
11
0 5 10 15 20
20 25
25 30
30 0.5
0.50.55
0.550.6
0.60.65
0.650.7
0.70.75
0.750.8
0.80.85
0.850.9
0.90.95
0.95 11
Number of iterations
iterations Discount
Discountfactor
factor
(a) (b)
(b)
Policy Iteration
Recap: Value iteration computes utilities ; optimal policy by MEU.
548 CHAPTER 27. MAKING COMPLEX DECISIONS
This even works if the utility estimate is inaccurate. (⇝ policy loss small)
Idea: search for optimal policy and utility values simultaneously [How60]: Iterate
policy evaluation: given policy πi , calculate Ui = U πi , the utility of each state
were πi to be executed.
policy improvement: calculate a new MEU policy πi+1 using 1 lookahead
Terminate if policy improvement yields no change in utilities.
Observation 27.3.8. Upon termination Ui is a fixpoint of Bellman update
; Solution to Bellman equation ; πi is an optimal policy.
Observation 27.3.9. Policy improvement improves policy and policy space is finite
; termination.
Policy Evaluation
Problem: How to implement the POLICY−EVALUATION algorithm?
Solution: To compute utilities given a fixed π: For all s we have
X
U (s) = R(s) + γ( U (s′ ) · P (s′ |s, π(s)))
s′
Partial Observability
Definition 27.4.1. A partially observable MDP (a POMDP for short) is a MDP
together with an sensor model O that has the sensor Markov property and is sta-
tionary: O(s, e) = P (e|s).
Problem: Agent does not know which state it is in ; makes no sense to talk
about policy π(s)!
Theorem 27.4.3 (Astrom 1965). The optimal policy in a POMDP is a function
π(b) where b is the belief state (probability distribution over states).
Idea: Convert a POMDP into an MDP in belief state space, where T (b, a, b′ ) is
the probability that the new belief state is b′ given that the current belief state is b
and the agent does a. I.e., essentially a filtering update step.
For POMDPs, we also need to consider actions. (but the effect is the same)
If b(s) is the previous belief state and agent does action a and then perceives e,
then the new belief state is
X
b′ (s′ ) = α · P (e|s′ ) · ( P (s′ |s, a) · b(s))
s
Consequence: The optimal policy can be written as a function π ∗ (b) from belief
states to actions.
Definition 27.4.4. The POMDP decision cycle is to iterate over
1. Given the current belief state b, execute the action a = π ∗ (b)
2. Receive percept e.
3. Set the current belief state to FORWARD(b, a, e) and repeat.
Intuition: POMDP decision cycle is search in belief state space.
Example 27.4.6. The belief state of the 4x3 world is a 11 dimensional continuous
space. (11 states)
Theorem 27.4.7. Solving POMDPs is very hard! (actually, PSPACE hard)
1
EdN:1
Variables with known values are gray, agent must choose a value for At .
Rewards for t = 0, . . . , t + 2, but utility for t + 3 (=
b discounted sum of rest)
1 EdNote: Commented out something about algorithms, make planning-based slides after AIMA3
552 CHAPTER 27. MAKING COMPLEX DECISIONS
Part of the lookahead solution of the DDN above (search over action tree)
circle =
b chance nodes (the environment decides) triangle =
b belief state (each
action decision is taken there)
Summary
Decision theoretic agents for sequential environments
Building on temporal, probabilistic models/inference (dynamic Bayesian networks)
The world is a POMDP with (initially) unknown transition and sensor models.
Machine Learning
555
557
This part introduces the foundations of machine learning methods in AI. We discuss the prob-
lem learning from observations in general, study inference-based techniques, and then go into
elementary statistical methods for learning.
558
Chapter 28
A Video Nugget covering the introduction to this chapter can be found at https://fau.tv/
clip/id/30369.
Outline
Learning agents
Inductive learning
Decision tree learning
559
560 CHAPTER 28. LEARNING FROM OBSERVATIONS
Learning Element
Observation: The design of learning element is dictated by
Preview:
Supervised learning: correct answers for each instance
Reinforcement learning: occasional rewards
Note:
1. Learning transition models is “supervised” if observable.
2. Supervised learning of correct actions requires “teacher”.
3. Reinforcement learning is harder, but requires no teacher.
Attribute-based Representations
Definition 28.3.1. In attribute based representations, examples are described by
attributes: (simple) functions on input samples, (think pre classifiers on
examples)
their value, and (classify by attributes)
classifications. (Boolean, discrete, continuous, etc.)
Example 28.3.2 (In a Restaurant). Situations where I will/won’t wait for a table:
28.3. LEARNING DECISION TREES 565
Attributes Target
Example Alt Bar F ri Hun P at P rice Rain Res T ype Est WillWait
X1 T F F T Some $$$ F T French 0–10 T
X2 T F F T Full $ F F Thai 30–60 F
X3 F T F F Some $ F F Burger 0–10 T
X4 T F T T Full $ F F Thai 10–30 T
X5 T F T F Full $$$ F T French >60 F
X6 F T F T Some $$ T T Italian 0–10 T
X7 F T F F None $ T F Burger 0–10 F
X8 F F F T Some $$ T T Thai 0–10 T
X9 F T T F Full $ T F Burger >60 F
X 10 T T T T Full $$$ F T Italian 10–30 F
X 11 F F F F None $ F F Thai 0–10 F
X 12 T T T T Full $ F F Burger 30–60 T
Decision Trees
Decision trees are one possible representation for hypotheses.
We evaluate the tree by going down the tree from the top, and always take the branch whose
attribute matches the situation; we will eventually end up with a Boolean value; the result. Using
the attribute values from X3 in Example 28.3.2 to descend through the tree in Example 28.3.4 we
indeed end up with the result “true”. Note that
1. some of the original set of attributes X3 are irrelevant.
2. the training set in Example 28.3.2 is realizable – i.e. the target is definable in hypothesis class
of decision trees.
566 CHAPTER 28. LEARNING FROM OBSERVATIONS
Expressiveness
Decision trees can express any function of the input attributes.
Example 28.3.6. for Boolean functions, truth table row ; path to leaf:
Hypothesis Spaces
Question: How many distinct decision trees are there with n Boolean attributes?
Answer: reserved for the plenary sessions ; be there!
Question: How many purely conjunctive hypotheses? (e.g., Hungry ∧ ¬Rain)
Choosing an attribute
Idea: a good attribute splits the examples into subsets that are (ideally) “all
positive” or “all negative”.
Example 28.3.8.
Attribute “Patrons?” is a better choice, it gives gives information about the classi-
fication.
Can we make this more formal? ; use information theory! (up next)
Information Entropy
568 CHAPTER 28. LEARNING FROM OBSERVATIONS
Scale: 1b =
b 1 bit =
b answer to Boolean question with prior probability (0.5,0.5).
Definition 28.4.1.
If the prior probability is ⟨P 1 , . . ., P n ⟩, then the information in an answer (also called
entropy of the prior) is
n
X
I(⟨P 1 , . . ., P n ⟩):= −P i · log2 (P i )
i=1
Result: Substantially simpler than “true” tree – a more complex hypothesis isn’t
justified by small amount of data.
570 CHAPTER 28. LEARNING FROM OBSERVATIONS
Performance measurement
Question: How do we know that h ≈ f ? (Hume’s Problem of Induction)
1. Use theorems of computational/statistical learning theory.
2. Try h on a new test set of examples. (use same distribution over example space
as training set)
Definition 28.4.10. The learning curve =
b percentage correct on test set as a
function of training set size.
Example 28.4.11. Restaurant data; graph averaged over 20 trials
For an attribute A with d values, compare the actual numbers pk and nk in each
subset sk with the expected numbers (expected if A is irrelevant)
pk +nk pk +nk
pbk = p · p+n and n bk = n · p+n .
572 CHAPTER 28. LEARNING FROM OBSERVATIONS
d
X 2 2
(pk − pbk ) (nk − n
bk )
∆= +
pbk bk
n
k=1
Idea: We think of examples (seen and unseen) as a sequence, and express the
“representativeness” as a stationarity assumption for the probability distribution.
Method: Each example before we see it is a random variable Ej , the observed
value ej = (xj ,yj ) samples its distribution.
Definition 28.5.1.
A sequence of E 1 , . . ., E n of random variables is independent and identically dis-
tributed (short IID), iff they are
independent, i.e. P(E j |E (j−1) , E (j−2) , . . .) = P(E j ) and
identically distributed, i.e. P(E i ) = P(E j ) for all i and j.
Example 28.5.2. A sequence of die tosses is IID. (fair or loaded does not matter)
Stationarity Assumption: We assume that the set E of examples is IID in the
future.
Definition 28.5.3. Given an inductive learning problem ⟨H, f ⟩, we define the error
rate of a hypothesis h∈H as the fraction of errors:
Caveat: A low error rate on the training set does not mean that a hypothesis
generalizes well.
Idea: Do not use homework questions in the exam.
Definition 28.5.4. The practice of splitting the data available for learning into
1. a training set from which the learning algorithm produces a hypothesis h and
2. a test set, which is used for evaluating h
is called holdout cross validation. (no peeking at test set allowed)
Model Selection
Definition 28.5.7. The model selection problem is to determine – given data – a
good hypothesis space.
Example 28.5.8. What is the best polynomial degree to fit the data
574 CHAPTER 28. LEARNING FROM OBSERVATIONS
Idea: Solve the two parts together by iteration over “size”. (they inform each
other)
Problem: Need a notion of “size” ⇝ e.g. number of nodes in a decision tree.
Concrete Problem: Find the “size” that best balances overfitting and underfitting
to optimize test set accuracy.
60
Validation Set Error
Training Set Error
50
40
Error rate
30
20
10
0
1 2 3 4 5 6 7 8 9 10
Tree size
Stops when training set error rate converges, choose optimal tree for validation
curve. (here a tree with 7 nodes)
Generalization Loss
Note: L(y, y) = 0. (no loss if you are exactly correct)
Definition 28.5.14 (Popular general loss functions).
absolute value loss L1 (y, yb):= |y − yb| small errors are good
2
squared error loss L2 (y, yb):=(y − yb) dito
0/1 loss L0/1 (y, yb):=0, if y = yb, else 1 error rate
Idea: Maximize expected utility by choosing hypothesis h that minimizes expec-
tationexpected loss over all (x,y)∈f .
576 CHAPTER 28. LEARNING FROM OBSERVATIONS
Definition 28.5.15. Let E be the set of all possible examples and P(X, Y ) the
prior probability distribution over its components, then the expected generalization
loss for a hypothesis h with respect to a loss function L is
X
GenLossL (h):= L(y, h(x)) · P (x, y)
(x,y)∈E
Empirical Loss
Problem: P(X, Y ) is unknown ; learner can only estimate generalization loss:
Regularization
Idea: Directly use empirical loss to solve model selection. (finding a good H)
Minimize the weighted sum of empirical loss and hypothesis complexity. (to avoid
overfitting).
Definition 28.5.17. Let λ∈R, h∈H, and E a set of examples, then we call
This works well in the limit, but for smaller problems there is a difficulty in that the
choice of encoding for the program affects the outcome.
e.g., how best to encode a decision tree as a bit string?
In recent years there has been more emphasis on large-scale learning. (millions of
examples)
578 CHAPTER 28. LEARNING FROM OBSERVATIONS
PAC Learning
Basic idea of Computational Learning Theory:
Any hypothesis h that is seriously wrong will almost certainly be “found out”
with high probability after a small number of examples, because it will make an
incorrect prediction.
Thus, if h is consistent with a sufficiently large set of training examples is unlikely
to be seriously wrong.
; h is probably approximately correct.
28.6. COMPUTATIONAL LEARNING THEORY 579
Definition 28.6.1. Any learning algorithm that returns hypotheses that are prob-
ably approximately correct is called a PAC learning algorithm.
Derive performance bounds for PAC learning algorithms in general, using the
PAC Learning
Start with PAC theorems for Boolean functions, for which L0/1 is appropriate.
Definition 28.6.2. The error rate error(h) of a hypothesis h is the probability that
h misclassifies a new example.
X
error(h):=GenLossL0/1 (h) = L0/1 (y, h(x)) · P (x, y)
(x,y)∈E
Sample Complexity
Let’s compute the probability that hb ∈Hb is consistent with the first N examples.
We know error(hb )>ϵ
N
; P (hb agrees with N examples)≤ (1ϵ) . (independence)
N N
; P (Hb contains consistent hyp.)≤#(Hb ) · (1 − ϵ) ≤#(H) · (1 − ϵ) .
(Hb ⊆ H)
; to bound this by a small δ, show the algorithm N ≥ 1ϵ · (log2 ( 1δ ) + log2 (#(H)))
examples.
Definition 28.6.4. The number of required examples as a function of ϵ and δ is
called the sample complexity of H.
n
Example 28.6.5. If H is the set of n-ary Boolean functions, then #(H) = 22 .
n
; sample complexity grows with O(log2 (22 )) = O(2n ).
There are 2 possible examples,
n
; PAC learning for Boolean functions needs to see (nearly) all examples.
580 CHAPTER 28. LEARNING FROM OBSERVATIONS
No No
P atrons(x, Some) P atrons(x, F ull) ∧ F ri/Sat(x) No
Yes Yes
Yes Yes
Lemma 28.6.8. Given arbitrary size conditions, decision lists can represent arbi-
trary Boolean functions. (equivalent to
CNF)
This directly defeats our purpose of finding a “learnable subset” of H.
Plug this into the equation for the sample complexity: N ≥ 1ϵ · (log2 ( 1δ ) + log2 (|H|))
to obtain
1 1
N ≥ · (log2 ( ) + log2 (O(nk log2 (nk ))))
ϵ δ
Intuitively: Any algorithm that returns a consistent decision list will PAC learn a
k-DL function in a reasonable number of examples, for small k.
1. find test that agrees exactly with some subset E of the training set,
2. add it to the decision list under construction and removes E,
3. construct the remainder of the DL using just the remaining examples,
until there are no examples left.
582 CHAPTER 28. LEARNING FROM OBSERVATIONS
0.9
0.8
Decision tree
0.7 Decision list
0.6
0.5
0.4
0 20 40 60 80 100
Training set size
Example 28.7.6.
1000
Examples of house price vs. square 900
House price in $1000
Idea: Minimize squared error loss over {(xi ,yi )|i≤N } (used already by Gauss)
N
X N
X N
X
2 2
Loss(hw ) = L2 (yj , hw (xj )) = (yj − hw (xj )) = (yj − (w1 xj + w0 ))
j=1 j=1 j=1
Remark: Closed-form solutions only exist for linear regression, for other (dif-
ferentiable) hypothesis spaces use gradient descent methods for adjusting/learning
weights.
Note: it is convex. w0
w1
Observation 28.7.8. The squared error loss function is convex for any linear
regression problem ; there are no local minimumlocal minima.
The parameter α is called the learning rate. It can be a fixed constant or it can
decay as learning proceeds.
28.7. REGRESSION AND CLASSIFICATION WITH LINEAR MODELS 585
Definition 28.7.10.
X X
w0 ←− w0 − α( −2(yj − hw (xj ))) w1 ←− w1 − α( −2(yj − hw (xn ))xn )
j j
These updates constitute the batch gradient descent learning rule for univariate
linear regression.
Convergence to the unique global loss minimum is guaranteed (as long as we pick
α small enough) but may be very slow.
Gradient descent will reach the (unique) minimum of the loss function; the update
equation for each weight wi is
X
wi ←− wi − α( xj,i (yj − hw (⃗xj )))
j
x2
5
earthquakes, black: underground 4.5
explosions 4
3.5
Also: hw∗ as a decision boundary 3
2.5
x2 = 17x1 − 4.9. 4.5 5 5.5 6 6.5 7
x1
wi ←− wi + α · (y − hw (x)) · xi
Example 28.7.18.
7.5
7
Learning curves (plots of total 6.5
6
training set accuracy vs. number 5.5
x2
5
Proportion correct
Proportion correct
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 0.4 0.4
0 100 200 300 400 500 600 700 0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000
Number of weight updates Number of weight updates Number of weight updates
Logistic Regression
∂ ∂ 2
(L2 (w)) = (y − hw (x) )
∂wi ∂wi
∂
= 2 · hw (x) · (y − hw (x))
∂wi
∂
= −2 · hw (x) · l′ (w·x) · (w·x)
∂wi
= −2 · hw (x) · l′ (w·x) · xi
Definition 28.7.22. The rule for logistic update (weight update for minimizing the
loss) is
wi ←− wi + α · (y − hw (x)) · hw (x) · (1 − hw (x)) · xi
Outline
Brains
Neural networks
Perceptrons
Multilayer perceptrons
Applications of neural networks
Brains
Axiom 28.8.1 (Neuroscience Hypothesis). Mental activity consists consists
primarily of electrochemical activity in networks of brain cells called neurons.
28.8. ARTIFICIAL NEURAL NETWORKS 591
One approach to Artificial Intelligence is to model and simulate brains. (and hope
that AI comes along naturally)
Definition 28.8.3. The AI sub field of neural networks (also called connectionism,
parallel distributed processing, and neural computation) studies computing systems
inspired by the biological neural networks that constitute brains.
Neural networks are attractive computational devices, since they perform important
AI tasks most importantly learning and distributed, noise-tolerant computation –
naturally and efficiently.
ai ← g(ini ) = g( wj,i aj )
Input Input Activation Output
j Links Function Function Output Links
Theorem 28.8.6 (McCulloch and Pitts). Every Boolean function can be imple-
mented as McCulloch Pitts networks.
Proof: by construction
P
1. Recall that ai ←− g( j wj,i aj ).
2. As for linear regression we use a0 = 1 ; w0,i as a bias weight (or intercept)
(determines the threshold)
3.
4. Any Boolean function can be implemented as a DAG of McCulloch Pitts units.
Recurrent neural networks have directed cycles with delays ; have internal state
(like flip-flops), can oscillate etc.
Single-layer Perceptrons
28.8. ARTIFICIAL NEURAL NETWORKS 593
1
0.8
0.6
0.4
0.2 -4
0 0-2
4 2 x2
Input w Output -2 0
x1
2 4 6 10 8
6
i,j
Layer Layer
Output units all operate separately, no shared weights ; treat as the combination
of n perceptron units.
Adjusting weights moves the location, orientation, and steepness of cliff.
a5 = g(w3,5 · a3 + w4,5 · a4 )
= g(w3,5 · g(w1,3 · a1 + w2,3 a2 ) + w4,5 · g(w1,4 · a1 + w2,4 a2 ))
Expressiveness of perceptrons
Consider a perceptron with g = step function (Rosenblatt, 1957, 1960)
594 CHAPTER 28. LEARNING FROM OBSERVATIONS
Can represent AND, OR, NOT, majority, etc., but not XOR (and thus no adders)
Represents a linear separator in input space:
X
wj xj > 0 or W, x· > 0
j
x1 x1 x1
1 1 1
0 0 0
0 1 x2 0 1 x2 0 1 x2
(a) x1 and x2 (b) x1 or x2 (c) x1 xor x2
Minsky & Papert (1969) pricked the first neural network balloon!
Perceptron Learning
Idea: Wlog. treat only single-output perceptrons ; w is a “weight vector”.
Learn by adjusting weights in w to reduce generalization loss on training set.
Let us compute with the squared error loss of a weight vector w for an example
(x,y).
2
Loss(w) = Err2 = (y − hw (x))
= −2 · Err · g ′ (inj ) · xj
0.8 0.8
0.7 0.7
Multilayer perceptrons
Definition 28.8.13. In multi layer perceptron (MLPs), layers are usually fully
connected;
numbers of hidden units typically chosen by hand.
Output Layer ai
wi,j
Hidden Layer aj
wi,j
Input Layer ak
Expressiveness of MLPs
All continuous functions w/ 2 layers, all functions w/ 3 layers.
Observation: The squared error loss of a weight matrix w for an example (x,y) is
n
X
2 2
Loss(w) = ∥(y − hw (x))∥2 = (yk − ak )
k=1
Output layer: Analogous to that for single-layer perceptron, but multiple output
units
wj,i ← wj,i + α · aj · ∆i
where ∆i = Erri · g ′ (ini ) and Err = y − hw (x). (error vector)
Definition 28.8.14. The back propagation rule for hidden nodes of a multilayer
perceptron is X
∆j ← g ′ (inj ) · ( wj,i ∆i )
i
wk,j ← wk,j + α · ak · ∆j
Back-Propagation Process
The back-propagation process can be summarized as follows:
1. Compute the ∆ values for the output units, using the observed error.
2. Starting with output layer, repeat the following for each layer in the network, until
the earliest hidden layer is reached:
(a) Propagate the ∆ values back to the previous (hidden) layer.
(b) Update the weights between the two layers.
Details (algorithm) later.
Compute the loss gradient wrt. the weights between the output and hidden layers:
Back-Propagation – Properties
At each epoch, sum gradient updates for all examples and apply.
Training curve for 100 restaurant examples: finds exact fit.
14
12
Total error on training set
10
8
6
4
2
0
0 50 100 150 200 250 300 350 400
Number of epochs
28.8. ARTIFICIAL NEURAL NETWORKS 599
0.9
0.8
0.7
0.4
0 10 20 30 40 50 60 70 80 90 100
Training set size
Experience shows: MLPs are quite good for complex pattern recognition tasks,
but resulting hypotheses cannot be understood easily.
This makes MLPs ineligible for some tasks, such as credit card and loan approvals,
where law requires clear unbiased criteria.
Summary
Most brains have lots of neurons; each neuron ≈ linear–threshold unit (?)
600 CHAPTER 28. LEARNING FROM OBSERVATIONS
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Idea: Use gradient descent to search the space of all w and b for maximizing
combinations. (works, but SVMs follow a different route)
3. The weights αj associated with each data point are zero except at the support
vectors the points closest to the separator.
There are good software packages for solving such quadratic programming opti-
mizations
P
Once we found an optimal vector α, use w = j αj xj .
1.5
√2x1x2
1
3
2
0.5
1
0
x2
0
-1
-2 2.5
-0.5 -3 2
0 1.5
-1 0.5
1 1 x22
1.5 0.5
-1.5 x21 2
-1.5 -1 -0.5 0 0.5 1 1.5
x1
√
right: mapping into a three-dimensional input space ⟨x1 2 , x2 2 , 2x1 x2 ⟩ ; separa-
ble by a hyperplane.
Upshot:
√ We map each input vector x to a F (x) with f1 = x1 , f2 = x2 , and
2 2
f3 = 2x1 x2 .
Idea: Replace xj ·xj by F (xj )·F (xj ) in the SVM equation.(compute in high dim
space.)
Often we can compute F (xj )·F (xj ) without computing F everywhere.
√ 2
Example 28.9.5. If F (x) = ⟨x1 2 , x2 2 , 2x1 x2 ⟩, then F (xj )·F (xj ) = (xj ·xj )
√
(have added the 2 in F so that this works)
2
We call the function (xj ·xj ) a kernel function. (there are others; next)
For supervised learning, the aim is to find a simple hypothesis that is approximately
consistent with training examples
Decision tree learning using information gain.
Learning performance = prediction accuracy measured on test set
Statistical Learning
605
606 CHAPTER 29. STATISTICAL LEARNING
What kind of bag is it? What flavour will the next candy be?
1 P(h1 | d)
P(h2 | d)
0.8 P(h3 | d)
P(h4 | d)
P(h5 | d)
0.6
0.4
0.2
0
0 2 4 6 8 10
Number of observations in d
Q
if the observation are IID, i.e. P (d|hi ) = j P (dj |hi ) and the hypothesis prior is
as advertised. (e.g. P (h3 |d) = 0.510 = 0.1%)
The posterior probabilities start with the hypothesis priors, change with data.
0.9
0.8
0.7
0.6
0.5
0.4
0 2 4 6 8 10
Number of observations in d
where P (d|hi ) is called the likelihood (of the data under each hypothesis) and
P (hi ) the hypothesis prior.
Bayesian predictions use a likelihood-weighted average over the hypotheses:
X X
P(X|d) = P(X|d, hi ) · P (hi |d) = P(X|hi ) · P (hi |d)
i i
Definition 29.2.1. For maximum a posteriori learning (MAP learning) choose the
MAP hypothesis hMAP that maximizes P (hi |d).
I.e., maximize P (d|hi ) · P (hi ) or (even better) log2 (P (d|hi )) + log2 (P (hi )).
Predictions made according to a MAP hypothesis hMAP are approximately Bayesian
to the extent that P(X|d) ≈ P(X|hMAP ).
Example 29.2.2. In our candy example, hMAP = h5 after three limes in a row
a MAP learner then predicts that candy 4 is lime with probability 1.
compare with Bayesian prediction of 0.8. (see prediction curves above)
As more data arrive, the MAP and Bayesian predictions become closer, because the
competitors to the MAP hypothesis become less and less probable.
For deterministic hypotheses, P (d|hi ) is 1 if consistent, 0 otherwise
; MAP = simplest consistent hypothesis. (cf. science)
Remark: Finding MAP hypotheses is often much easier than Bayesian learning,
because it requires solving an optimization problem instead of a large summation
(or integration) problem.
Indeed if hypothesis predicts the data exactly – e.g. h5 in candy example – then
log2 (1) = 0 ; preferred hypothesis.
This is more directly modeled by the following approximation to Bayesian learning:
Observation: For large data sets, the prior becomes irrelevant. (we might not
trust it anyways)
Idea: Use this to simplify learning.
Definition 29.2.4. Maximum likelihood learning (ML learning): choose the ML
hypothesis hML maximizing P (d|hi ). (simply get the best fit to the data)
Remark: ML learning = b MAP learning for a uniform prior. (reasonable if all
hypotheses are of the same complexity)
ML learning is the “standard” (non Bayesian) statistical learning method.
P (F = cherry)
θ
Flavor
Trick: When optimizing a product, optimize the logarithm instead! (log2 (!) is
monotone and turns products into sums)
Definition 29.3.3. The log likelihood is just the binary logarithm of the likelihood.
L(d|h):=log2 (P (d|h))
In English: hθ asserts that the actual proportion of cherries in the bag is equal to
the observed proportion in the candies unwrapped so far!
Seems sensible, but causes problems with 0 counts!
1. Write down an expression for the likelihood of the data as a function of the
parameter(s).
2. Write down the derivative of the log likelihood with respect to each parameter.
3. Find the parameter values such that the derivatives are zero
P (F = cherry)
θ
Flavor
F P (W = red|F )
cherry θ1
lime θ2
Wrapper
0.8
P(y |x)
4 0.6
3.5
y
3
2.5 0.4
2
1.5
1 1
0.5 0.8 0.2
0 0.6
0 0.2 0.4 y
0.4 0.6 0.2
0.8 0 0
x 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
Note: This kind of model is called “naive” since it is often used as a simplifying
model if the effects are not conditionally independent after all.
0.9
0.8
0.7
0.6
Decision tree
0.5 Naive Bayes
0.4
0 20 40 60 80 100
Training set size
Naive Bayes learning scales well: with n Boolean attributes, there are just
2n + 1 parameters, and no search is required to find hML .
Naive Bayes learning systems have no difficulty with noisy or missing data and can
give probabilistic predictions when appropriate.
Maximum likelihood learning assumes uniform prior, OK for large data sets:
1. Choose a parameterized family of models to describe the data.
; requires substantial insight and sometimes new models.
2. Write down the likelihood of the data as a function of the parameters.
; may require summing over hidden variables, i.e., inference.
3. Write down the derivative of the log likelihood w.r.t. each parameter.
4. Find the parameter values such that the derivatives are zero.
; may be hard/impossible; modern optimization techniques help.
Knowledge in Learning
Definition 30.1.3. Logic based inductive learning tries to learn an hypothesis h that
explains the classifications of the examples given their descriptions, i.e. h, D |= C
(the explanation constraint), where
615
616 CHAPTER 30. KNOWLEDGE IN LEARNING
For instance the first example X 1 from Example 28.3.2, can be described as
The classification is given by the goal predicate WillWait, in this case WillWait(X 1 )
or ¬WillWait(X 1 ).
can be represented as
Method: Construct a disjunction of all the paths from the root to the positive
leaves interpreted as conjunctions of the attributes on the path.
Note: The equivalence takes care of positive and negative examples.
Cumulative Development
Example 30.1.6. Learning from very few examples using background knowledge:
1. Caveman Zog and the fish on a stick:
Prior
Knowledge
Logic based
Observations Hypotheses Predictions
inductive learning
Explanation-based Learning
Idea: Use explanation of success to infer a general rule.
Definition 30.1.9. Explanation based learning (EBL) refines the explanation con-
straint to the EBL constraints:
Relevance-based Learning
Idea: Use the prior knowledge to determine the relevance of a set of features to
the goal predicate. (reduce the hypothesis space to the relevant ones)
Example 30.1.10. In a given country most people speak the same language, but
do not have the same name:
Definition 30.1.11. Relevance based learning (RBL) refines the explanation con-
straint to the RBL constraints:
Deductive Learning
Definition 30.1.12. We call a procedure a deductive learning algorithm, if it makes
use of the observations, but does not produce hypothesis beyond the background
knowledge and the observations.
Definition 30.1.21. Knowledge based inductive learning (KBIL) replaces the ex-
planation constraint by the KBIL constraint:
Explanation-Based Learning
Intuition: EBL =
b Extracting general rules from individual observations.
Example 30.2.1. Differentiating and simplifying algebraic expressions
30.2. EXPLANATION-BASED LEARNING 621
Rewr(1 × u, u)
Rewr(0 + u, u)
...
Simpl(1 × (0 + X), w)
ArithVar(X)
yes
ArithVar(z)
yes
Example 30.2.4. Take the leaves of the generalized proof tree to get the general
rule
Recap:
Use background knowledge to construct a proof for the example.
In parallel, construct a generalized proof tree.
New rule is the conjunction of the leaves of the proof tree and the variabilized
goal.
Drop conditions that are true regardless of the variables in the goal.
Example 30.2.5.
prim(z) ⇒ Simpl(1 × (0 + z), z)
Simpl(y + z, w) ⇒ Simpl(1 × (y + z), w)
Adding large numbers of rules to the knowledge base slows down the reasoning
process (increases the branching factor of the search space).
To compensate, the derived rules must offer significant speed increases.
Derived rules should be as general as possible to apply to the largest possible
set of cases.
Deductive learning: Makes use of the observations, but does not produce hypothesis
beyond the background knowledge and the observations.
So
entails
∀x Nationality(x, Brazil) ⇒ Language(x, P ortugese)
Special syntax: Nationality(x, n)≻Language(x, l)
Intuition: If we know the values of P and Q for one example x, e.g., P (x, a) and
Q(x, b), we can use the determination P ≻Q and to infer ∀y P (y, a) ⇒ Q(y, b).
Deriving Hypotheses
Given an algorithm for learning determinations, a learning agent has a way to
construct a minimal hypothesis within which to learn the target predicate.
Idea: Use decision tree learning for computing hypotheses.
Goal: Minimize size of hypotheses.
Result: Relevance based decision tree learning.
Exploiting Knowledge
RBDTL simultaneously learns and uses relevance information to minimize its hy-
pothesis space.
Declarative bias
How can prior knowledge be used to identify the appropriate hypothesis space
to search for the correct target definition?
Unanswered questions: (engineering needed to make ideas practical)
1
Proportion correct on test set
0.9 RBDTL
DTL
0.8
0.7
0.6
0.5
0.4
0 20 40 60 80 100 120 140
Training set size
30.4.1 An Example
A Video Nugget covering this subsection can be found at https://fau.tv/clip/id/30396.
ILP: An example
General knowledge-based induction problem
George Mum
Example
630 CHAPTER 30. KNOWLEDGE IN LEARNING
Background knowledge
Observation: A little bit of background knowledge helps a lot.
Example 30.4.3. If the background knowledge contains
FOIL
function Foil(examples,target) returns a set of Horn clauses
inputs: examples, set of examples
target, a literal for the goal predicate
local variables: clauses, set of clauses, initially empty
while examples contains positive examples do
clause := New−Clause(examples,target)
remove examples covered by clause from examples
add clause to clauses
return clauses
FOIL
function New−Clause(examples,target) returns a Horn clause
local variables: clause, a clause with target as head and an empty body
l, a literal to be added to the clause
extendedExamples, a set of examples with values for new variables
extendedExamples := examples
while extendedExamples contains negative examples do
l := Choose−Literal(New−Literals(clause),extendedExamples)
append l to the body of clause
extendedExamples := map Extend−Example over extendedExamples
return clause
function Extend−Example(example,literal) returns a new example
if example satisfies literal
then return the set of examples created by extending example with each
possible constant value for each new variable in literal
else return the empty set
function New−Literals(clause) returns a set of possibly ‘‘useful’’ literals
function Choose−Literal(literals) returns the ‘‘best’’ literal from literals
father(x, z) ⇒ grandfather(x, y)
Add literals using predicates
Negated or unnegated
Use any existing predicate (including the goal)
Arguments must be variables
Each literal must include at least one variable from an earlier literal or from the
head of the clause
Valid: M other(z, u), M arried(z, z), grandfather(v, x)
Invalid: M arried(u, v)
30.4. INDUCTIVE LOGIC PROGRAMMING 633
Inverse Resolution
Inverse resolution in a nutshell:
Classifications follows from Background ∧ Hypothesis ∧ Descriptions.
This can be proven by resolution
Run the proof backwards to find hypothesis
Problem: How to run the proof backwards?
Recap: In ordinary resolution we take two clauses C1 = L ∨ R1 and C2 = ¬L ∨ R2
and resolve them to produce the resolvent C = R1 ∨ R2 .
[George/x],[Elisabeth/y]
[Anne/y]
T rue ⇒ F alse
Can inverse resolution infer the law of gravity from examples of falling bodies?
Yes, given suitable background mathematics!
Monkey and typewriter problem: How to overcome the large branching factor and
the lack of structure in the search space?
Inventing new predicates is important to reduce the size of the definition of the
goal predicate.
Some of the deepest revolutions in science come from the invention of new predi-
cates. (e.g. Galileo’s invention of
acceleration)
Applications of ILP
ILP systems have outperformed knowledge free methods in a number of domains.
Molecular biology: the GOLEM system has been able to generate high-quality
predictions of protein structures and the therapeutic efficacy of various drugs.
636 CHAPTER 30. KNOWLEDGE IN LEARNING
Reinforcement Learning
Unsupervised Learning
So far: we have studied “learning from examples”. (functions, logical theories,
probability models)
Now: How can agents learn “what to do” in the absence of labeled examples of
“what to do”. We call this problem unsupervised learning.
Example 31.1.1 (Playing Chess). Learn transition models for own moves and
maybe predict opponent’s moves.
Problem: The agent needs to have some feedback about what is good/bad
; cannot decide “what to do” otherwise. (recall: external performance standard
for learning agents)
Example 31.1.2. The ultimate feedback in chess is whether you win, lose, or draw.
Definition 31.1.3. We call a learning situation where there are no labeled examples
unsupervised learning and the feedback involved a reward or reinforcement.
Example 31.1.4. In soccer, there are intermediate reinforcements in the shape of
goals, penalties, . . .
637
638 CHAPTER 31. REINFORCEMENT LEARNING
In MDPs, the agent has total knowledge about the environment and the reward
function, in reinforcement learning we do not assume this. (; POMDPs+rewards)
Example 31.1.6. You play a game without knowing the rules, and at some time
the opponent shouts you lose!
Passive Learning
Definition 31.2.1 (To keep things simple). Agent uses a state-based represen-
tation in a fully observable environment:
–1 –1
3 +1 3 0.812 0.868 0.918 +1
+1 +1
1 2 3 4 1 2 3 4
Optimal Policy π
Figure 17.3 The utilities Utilities,
of the states given
in <
the π
– 0.0221 < R(s) 0 4 × 3 world, calculated
R(s) > 0 with γ = 1 and
R(s) = − 0.04 for nonterminal states.
The agent executes a(a)set of trials in the environment using its policy
(b) π.
Figurethe
In each trial, 17.2agent
(a)starts
AnTheoptimal policy
in state
utility forU the
(1,1)
function and
(s) stochastic
the environment
experiences
allows with
a select
agent to sequence R(s) =using
ofbystate
actions theinprinciple of
− 0.04
the nonterminal states.
maximum
transitions until it reaches (b)
one of Optimal
expected policies
utility from
the terminal for four
Chapter
states, different
16—that
(4,2) ranges of R(s).
is, choose the action that maximizes the
or (4,3).
expected utility of the subsequent state:
Its percepts supply both the current state! and the reward received in that state.
and (3,3) are as shown, every π ∗ (s) = policy
argmax is optimal,
P (s# |and
s, a)U the(sagent
#
). obtains infinite total reward be- (17.4)
cause it never enters a terminal state. a∈A(s)Surprisingly,
s" it turns out that there are six other optimal
policies for various ranges
Michael Kohlhase: of Intelligence
Artificial R(s); Exercise
2 17.51081asks you to find them.
2023-02-10
The next two sections describe algorithms for finding optimal policies.
The careful balancing of risk and reward is a characteristic of MDPs that does not
arise in deterministic searchthisproblems;
2 Although seems obvious,moreover,
it does not it is fora finite-horizon
hold characteristic of ormany
policies real-world
for other ways of combining
Passivedecision
Learning
problems.byrewards
Example
Forover thistime.reason,
The proof MDPs
follows have
directlybeen
from the studied
uniquenessinofseveral
the utility fields, including
function on states, as shown in
Section 17.2.
AI, operations research, economics, and control theory. Dozens of algorithms have been
Example
proposed31.2.3. Typical trials
for calculating optimal might look like
policies. this:
In sections 17.2 and 17.3 we describe two of the
most important algorithm families. First, however, we must complete our investigation of
1. (1,utilities ; (1,
1)−0.4 and 2)−0.4for;sequential
policies (1, 3)−0.4decision ; (1, 2) −0.4 ; (1, 3)−0.4 ; (2, 3)−0.4 ;
problems.
(3, 3)−0.4 ; (4, 3)+1
2. (1,17.1.1
1)−0.4 ; Utilities
(1, 2)−0.4 over;time
(1, 3)−0.4 ; (2, 3)−0.4 ; (3, 3)−0.4 ; (3, 2)−0.4 ;
(3, 3)−0.4 ; (4, 3)+1
In the MDP example in Figure 17.1, the performance of the agent was measured by a sum of
3. (1, 1)−0.4 ; (2, 1)−0.4 ; (3, 1)−0.4 ; (3, 2)−0.4 ; (4, 2)−1 .
rewards for the states visited. This choice of performance measure is not arbitrary, but it is
not the only
Definition 31.2.4. possibility for the
The utility is utility
definedfunction
to be the on environment
expected sum histories, which we write as
of (discounted)
rewards obtained
Uh ([s 0 , s1 , . . .if, s n ]). Our
policy π isanalysis draws on multiattribute utility theory (Section 16.4) and
followed.
is somewhat technical; the impatient "reader may wish # to skip to the next section.
X∞
FINITE HORIZON The first question to answer is whether there is a finite horizon or an infinite horizon
U π (s):=E γ t R(S )
INFINITE HORIZON for decision making. A finite horizon means that tthere is a fixed time N after which nothing
t=0
matters—the game is over, so to speak. Thus, Uh ([s0 , s1 , . . . , sN +k ]) = Uh ([s0 , s1 , . . . , sN ])
whereforR(s)
all kis>the 0. Forrewardexample,
for a suppose
state, Stan(aagent
random startsvariable)
at (3,1) in
is the 4state
× 3 world
reachedof Figure
at 17.1,
time and suppose
t when that Npolicy
executing = 3. Then,
π, andtoShave
0 = any
s. (forchance
4 × of
3 wereaching
take the
the +1 state,
discount the
factoragent must
γ = 1)head directly for it, and the optimal action is to go Up. On the other hand, if N = 100,
then there is plenty of time to take the safe route by going Left. So, with a finite horizon,
Michael Kohlhase: Artificial Intelligence 2 1082 2023-02-10
A simple method for direct utility estimation was invented in the late 1950s in the
area of adaptive control theory.
Definition 31.2.5. The utility of a state is the expected total reward from that
state onward (called the expected reward to go).
Idea: Each trial provides a sample of the reward to go for each state visited.
Example 31.2.6. The first trial in Example 31.2.3 provides a sample total reward
of 0.72 for state (1,1), two samples of 0.76 and 0.84 for (1,2), two samples of 0.80
and 0.88 for (1,3), . . .
Definition 31.2.7. The direct utility estimation algorithm cycles over trials, cal-
culates the reward to go for each state, and updates the estimated utility for that
state by keeping the running average for that for each state in a table.
Observation 31.2.8. In the limit, the sample average will converge to the true
expectation (utility) from Definition 31.2.4.
Remark 31.2.9. Direct utility estimation is just supervised learning, where each
example has the state as input and the observed reward to go as output.
Upshot: We have reduced reinforcement learning to an inductive learning problem.
But direct utility estimation learns nothing until the end of the trial.
Intuition: Direct utility estimation searches for U in a hypothesis space that too
large ⇝ many functions that violate the Bellman equations.
31.2. PASSIVE LEARNING 641
Example 31.2.16 (Passive ADP learning curves for the 4x3 world).
Given the optimal policy from Example 31.2.2
Note the large changes occurring around the 78th trial – this is the first time that
the agent falls into the -1 terminal state at (4,2).
Observation 31.2.17. The ADP agent is limited only by its ability to learn the
transition model. (intractable for large state spaces)
Idea: Adapt the passive ADP algorithm to handle this new freedom.
learn a complete model with outcome probabilities for all actions, rather than
just the model for the fixed policy. (use PASSIVE-ADP-AGENT)
31.3. ACTIVE REINFORCEMENT LEARNING 643
choose actions; the utilities to learn are defined by the optimal policy, they obey
the Bellman equation:
X
U (s) = R(s) + γ · max ( U (s′ ) · P (s′ |s, a))
a∈A(s)
s′
Example 31.3.1 (Greedy ADP learning curves for the 4x3 world).
The agent follows the optimal policy for the learned model at each step.
It does not learn the true utilities or the true optimal policy!
instead, in the 39th trial, it finds a policy that reaches the +1 reward along the
lower route via (2,1), (3,1), (3,2), and (3,3).
After experimenting with minor variations, from the 276th trial onward it sticks
to that policy, never learning the utilities of the other states and never finding
the optimal route via (1,2), (1,3), and (2,3).
What can be done? The agent does not know the true environment.
Idea: actions do more than provide rewards according to the learned model
they also contribute to learning the true model by affecting the percepts received.
By improving the model, the agent may reap greater rewards in the future.
Observation 31.3.3. An agent must make a tradeoff between
exploitation to maximize its reward as reflected in its current utility estimates
and
exploration to maximize its long term well-being.
Pure exploitation risks getting stuck in a rut. Pure exploration to improve one’s
knowledge is of no use if one never puts that knowledge into practice.
Compare with the information gathering agent from section 25.6.
Communication
645
647
In other words: the language you use all day long, e.g. English, German, . . .
Why Should we care about natural language?:
Even more so than thinking, language is a skill that only humans have.
It is a miracle that we can express complex thoughts in a sentence in a matter
of seconds.
It is no less miraculous that a child can learn tens of thousands of words and a
complex grammar in a matter of a few years.
Language Technology
Language Assistance:
written language: Spell/grammar/style-checking,
spoken language: dictation systems and screen readers,
multilingual text: machine-supported text and dialog translation, eLearning.
Information management:
649
650 CHAPTER 32. NATURAL LANGUAGE PROCESSING
Psychology/Cognition: Semantics =
b “what is in our brains” (; mental models)
Mathematics has driven much of modern logic in the quest for foundations.
Logic as “foundation of mathematics” solved as far as possible
In daily practice syntax and semantics are not differentiated (much).
A good probe into the issues involved in natural language understanding is to look at trans-
lations between natural language utterances – a task that arguably involves understanding the
utterances first.
32.2. NATURAL LANGUAGE AND ITS MEANING 651
Example 32.2.2. Wirf der Kuh das Heu über den Zaun. ̸;Throw the cow the
hay over the fence. (differing grammar; Google Translate)
Example 32.2.3. Grammar is not the only problem
Der Geist ist willig, aber das Fleisch ist schwach!
Der Schnaps ist gut, aber der Braten ist verkocht!
Observation 32.2.4. We have to understand the meaning for high-quality trans-
lation!
If it is indeed the meaning of natural language, we should look further into how the form of the
utterances and their meaning interact.
Newspaper ;
For questions/answers, it would be very useful to find out what words (sentences/-
texts) mean.
Interpretation of natural language utterances: three problems
language
utterance
Let us support the last claim a couple of initial examples. We will come back to these phenomena
again and again over the course of the course and study them in detail.
652 CHAPTER 32. NATURAL LANGUAGE PROCESSING
But there are other phenomena that we need to take into account when compute the meaning of
NL utterances.
Grammar Inference
relevant
Utterance Meaning information
of utterance
Lexicon World knowledge
We will look at another example, that shows that the situation with pragmatic analysis is even
more complex than we thought. Understanding this is one of the prime objectives of the AI-2
lecture.
32.2. NATURAL LANGUAGE AND ITS MEANING 653
Grammar Inference
utterance- relevant
semantic
Utterance specific information
potential
meaning of utterance
Example 32.2.12 is also a very good example for the claim Observation 32.2.4 that even for high-
quality (machine) translation we need semantics. We end this very high-level introduction with
a caveat.
Demo:
DBPedia http://dbpedia.org/snorql/
Query: Soccer players, who are born in a country with more than 10 million in-
habitants, who played as goalkeeper for a club that has a stadium with more than
30.000 seats and the club country is different from the birth country
Even if we can get a perfect grasp of the semantics (aka. meaning) of NL utterances, their structure
and context dependency – we will try this in this lecture, but of course fail, since the issues are
much too involved and complex for just one lecture – then we still cannot account for all the
human mind does with language. But there is hope, for limited and well-understood domains,
we can to amazing things. This is what this course tries to show, both in theory as well as in
practice.
32.3. LOOKING AT NATURAL LANGUAGE 655
Logical analysis vs. conceptual analysis: These examples — mostly borrowed from David-
son:tam67 — help us to see the difference between "‘logical-analysis’ and "‘conceptual-analysis’.
We observed that from This is a big diamond. we cannot conclude This is big. Now consider the
sentence Jane is a beautiful dancer. Similarly, it does not follow from this that Jane is beautiful,
but only that she dances beautifully. Now, what it is to be beautiful or to be a beautiful dancer
is a complicated matter. To say what these things are is a problem of conceptual analysis. The
job of semantics is to uncover the logical form of these sentences. Semantics should tell us that
the two sentences have the same logical forms; and ensure that these logical forms make the right
predictions about the entailments and truth conditions of the sentences, specifically, that they
don’t entail that the object is big or that Jane is beautiful. But our semantics should provide a
distinct logical form for sentences of the type: This is a fake diamond. From which it follows that
the thing is fake, but not that it is a diamond.
One way to think about the examples of ambiguity on the previous slide is that they illustrate a
certain kind of indeterminacy in sentence meaning. But really what is indeterminate here is what
656 CHAPTER 32. NATURAL LANGUAGE PROCESSING
sentence is represented by the physical realization (the written sentence or the phonetic string).
The symbol duck just happens to be associated with two different things, the noun and the verb.
Figuring out how to interpret the sentence is a matter of deciding which item to select. Similarly
for the syntactic ambiguity represented by PP attachment. Once you, as interpreter, have selected
one of the options, the interpretation is actually fixed. (This doesn’t mean, by the way, that as
an interpreter you necessarily do select a particular one of the options, just that you can.) A
brief digression: Notice that this discussion is in part a discussion about compositionality, and
gives us an idea of what a non-compositional account of meaning could look like. The Radical
Pragmatic View is a non-compositional view: it allows the information content of a sentence to
be fixed by something that has no linguistic reflex.
To help clarify what is meant by compositionality, let me just mention a couple of other ways
in which a semantic account could fail to be compositional.
• Suppose your syntactic theory tells you that S has the structure [a[bc]] but your semantics
computes the meaning of S by first combining the meanings of a and b and then combining the
result with the meaning of c. This is non-compositional.
Sentence 1. entails that George was late; sentence 2. doesn’t. We might try to account for
this by saying that in the environment of the verb believe, a clause doesn’t mean what it
usually means, but something else instead. Then the clause that George was late is assumed
to contribute different things to the informational content of different sentences. This is a
non-compositional account.
Example 32.3.4. Every man loves a woman. (Keira Knightley or his mother!)
Example 32.3.5. Every car has a radio. (only one reading!)
Example 32.3.6. Some student in every course sleeps in every class at least
some of the time. (how many readings?)
Example 32.3.7. The president of the US is having an affair with an intern.
(2002 or 2000?)
Example 32.3.8. Everyone is here. (who is everyone?)
Observation: If we look at the first sentence, then we see that it has two readings:
2. for each man there is one woman whom that man loves.
These correspond to distinct situations (or possible worlds) that make the sentence true.
Observation: For the second example we only get one reading: the analogue of 2. The reason
for this lies not in the logical structure of the sentence, but in concepts involved. We interpret
the meaning of the word has as the relation “has as physical part”, which in our world carries a
certain uniqueness condition: If a is a physical part of b, then it cannot be a physical part of c,
32.3. LOOKING AT NATURAL LANGUAGE 657
unless b is a physical part of c or vice versa. This makes the structurally possible analogue to 1.
impossible in our world and we discard it.
Observation: In the examples above, we have seen that (in the worst case), we can have one
reading for every ordering of the quantificational phrases in the sentence. So, in the third example,
we have four of them, we would get 4! = 24 readings. It should be clear from introspection that
we (humans) do not entertain 12 readings when we understand and process this sentence. Our
models should account for such effects as well.
Context and Interpretation: It appears that the last two sentences have different informational
content on different occasions of use. Suppose I say Everyone is here. at the beginning of class.
Then I mean that everyone who is meant to be in the class is here. Suppose I say it later in the
day at a meeting; then I mean that everyone who is meant to be at the meeting is here. What
shall we say about this? Here are three different kinds of solution:
Radical Semantic View On every occasion of use, the sentence literally means that everyone
in the world is here, and so is strictly speaking false. An interpreter recognizes that the speaker
has said something false, and uses general principles to figure out what the speaker actually
meant.
Radical Pragmatic View What the semantics provides is in some sense incomplete. What the
sentence means is determined in part by the context of utterance and the speaker’s intentions.
The differences in meaning are entirely due to extra-linguistic facts which have no linguistic
reflex.
The Intermediate View The logical form of sentences with the quantifier every contains a slot
for information which is contributed by the context. So extra-linguistic information is required
to fix the meaning; but the contribution of this information is mediated by linguistic form.
John loves his wife. Peter does too. (whom does Peter love?)
John loves golf, and Mary too. (who does what?)
Remark 32.4.2. Natural languages like English, German, or Spanish are not.
Example 32.4.3. Let us look at concrete examples
Not to be invited is sad! (definitely English)
To not be invited is sad! (controversial)
Definition 32.4.5. A text corpus (or simply corpus; plural corpora) is a large and
structured collection of natural language texts.
Definition 32.4.6. In corpus linguistics, corpora are used to do statistical analysis
and hypothesis testing, checking occurrences or validating linguistic rules within a
specific natural language.
As for Markov processes, we write P (c1:N ) for the probability of a character sequence
c1 . . .cn of length N .
Definition 32.4.7. We call an character sequence of length n an n gram (unigram,
bigram, trigram for n = 1, 2, 3).
with the chain rule and then using the Markov property, we obtain
N
Y N
Y
P (c1:N ) = P (ci |c1:i−1 ) = P (ci |c(i−2) , c(i−1) )
i=1 i=1
Thus, a trigram model for a language with 100 characters, P(ci |ci−2:i−1 ) has
1.000.000 entries. It can be estimated from a corpus with 107 characters.
One approach: Build a trigram language model P(ci |ci−2:i−1 , ℓ) for each candi-
date language ℓ by counting trigrams in a ℓ-corpus.
Apply Bayes’ rule and the Markov property to get the most likely language:
ℓ∗ = argmax (P (ℓ|c1:N ))
ℓ
= argmax (P (ℓ) · P (c1:N |ℓ))
ℓ
N
Y
= argmax (P (ℓ) · P (ci |ci−2:i−1 , ℓ))
ℓ i=1
The prior probability P (ℓ) can be estimated, it is not a critical factor, since the
trigram language models are extremely sensitive.
recognize that Mr. Sopersteen is the name of a person and aciphex is the name of
a drug.
Remark 32.4.16. Character-level language models are good for this task because
they can associate the character sequence ex with a drug name and steen with a
person name, and thereby identify words that they have never seen before.
1. There are many more words than characters. (100 vs. 105 in Englisch)
2. And what is a word anyways? (space/punctuation-delimited substrings?)
3. Data sparsity: we do not have enough data! For a language model| for (105 )
words in English, we have 1015 trigrams.
4. Most training corpora do not have all words.
This trick can be refined if we have a word classifier, then use a new token per class,
e.g. <EMAIL> or <NUM>.
Clearly there are differences, how can we measure them to evaluate the models?
Definition 32.4.20. The perplexity of a sequence c1:N is defined as
1
−( N )
Perplexity(c1:N ):=P (c1:N )
Information Retrieval
Definition 32.5.1. Information retrieval (IR) deals with the representation, orga-
nization, storage, and maintenance of information objects so that it provides the
user with easy access to the relevant information and satisfies the user’s various
information needs.
Example 32.5.3. Google and Bing are web search engines, their query is a bag
of words and documents are web pages, PDFs, images, videos, shopping portals.
Definition 32.5.5.
A multiset of words in V = {t1 , . . ., tn } is called a bag of words (BOW), and can be
represented as a word frequency vectors in N|V | : the vector of raw word frequencies.
Example 32.5.6.
If we have two documents: d1 = Have a good day! and d2 = Have a great day!,
then we can use V = Have, a, good, great, day and can represent good as ⟨0, 0, 1, 0, 0⟩,
great as ⟨0, 0, 0, 1, 0⟩, and d1 a ⟨1, 1, 1, 0, 1⟩.
Words outside the vocabulary are ignored in the BOW approach. So the document
d3 = What a day, a good day is represented as ⟨0, 2, 1, 0, 2⟩.
term 1
D1 (t1,1 , t1,2 , t1,3 )
D2 (t2,1 , t2,2 , t2,3 )
term 3
term 2
Idea: Introduce a weighting factor for the word frequency vector that de-emphasizes
the dimension of the more (globally) frequent words.
We need to normalize the word frequency vectors first:
Definition 32.5.10. Given a document d and a vocabulary word t∈V , the normal-
ized term frequency (also usually called just term frequency) tf(t, d) is the raw term
frequency divided by |d|.
32.5. INFORMATION RETRIEVAL 663
Definition 32.5.11.
Given a document collection D = {d1 , . . ., dN } and a word t the inverse document
N
frequency is given by idf(t, D):=log10 ( |{d∈D|t∈d}| ).
TF-IDF Example
Let D:={d1 , d2 } be a document corpus over the vocabulary
Once an answer set has been determined, the results have to be sorted, so that they can be
presented to the user. As the user has a limited attention span – users will look at most at three
to eight results before refining a query, it is important to rank the results, so that the hits that
contain information relevant to the user’s information need early. This is a very difficult problem,
as it involves guessing the intentions and information context of users, to which the search engine
has no access.
Problem: There are many hits, need to sort them (e.g. by importance)
Idea: A web site is important, . . . if many other hyperlink to it.
664 CHAPTER 32. NATURAL LANGUAGE PROCESSING
Getting the ranking right is a determining factor for success of a search engine. In fact, the early
of Google was based on the pagerank algorithm discussed above (and the fact that they figured
out a revenue stream using text ads to monetize searches).
Word Embeddings
Definition 32.6.2. A vector is called one hot, iff all components are 0 except for
one 1. We call a word embedding one hot, iff all of its vectors are.
Example 32.6.3 (Vector Space Methods in Information Retrieval).
Word frequency vectors are induced by adding up one hot word embeddings.
Example 32.6.4. Given a document corpus D – the context – the tf idf word
embedding is given by e : t7→⟨tfidf(t, d1 , D), . . . ,tfidf(t, d#(D) , D)⟩.
Intuition behind these two: Words that occur in similar documents are similar.
Example 32.6.5. For the text watch movies rather than read books
context size: 2, target rather, we have
context: C:={watch, movies, than, read}
Vocabulary: V :={watch, movies, rather, than, read, books}
The hidden layer neurons just copy the weighted sum of inputs to the next layer
(no threshold)
The output layer computes the softmax of the hidden nodes
Outline
Communication
Grammars and syntactic analysis
Problems (real Language Phenomena)
Communication
“Classical” view (pre-1953):
Language consists of sentences that are true/false. (cf. logic)
“Modern” view (post-1953):
Language is a form of action!
Speech Acts
669
670 CHAPTER 33. NATURAL LANGUAGE FOR COMMUNICATION
Stages in communication (informing): even here, the situation is complex
33.2 Grammar
A Video Nugget covering this section can be found at https://fau.tv/clip/id/35581.
33.2. GRAMMAR 671
We fortify our intuition about these – admittedly very abstract – constructions by an example
and introduce some more vocabulary.
S → NP ; Vi
NP → Article; N
Article → the | a | an
N → dog | teacher | . . .
Vi → sleeps | smells | . . .
Definition 33.2.4. The non-lexicon grammar rules are called structural, and the
nonterminals in the heads are called phrasal categories.
Context-Free Parsing
Recall: The sentences accepted by a grammar are defined “top-down” as those
the start symbol can be rewritten into.
Definition 33.2.5. Bottom up parsing works by replacing any substring that
672 CHAPTER 33. NATURAL LANGUAGE FOR COMMUNICATION
VP
NP NP
[S[N P [P ronoun I]][V P [T ransV erb shoot][N P [Article the][N oun Wumpus]]]]
Grammaticality Judgements
Problem: The formal language LG accepted by a grammar G may differ from the
natural language Ln it supposedly models.
Definition 33.2.8. We say that a grammar G over generates, iff it accepts strings
outside of Ln (false positives) and under generates, iff there are Ln strings (false
negatives) that LG does not accept.
Wumpus grammar
S → NP ; VP [.9] I + feel a breeze
| S ; Conjunction; S [.1] I feel a breeze + and + I smell a wumpus
NP → Pronoun [.3] I
| Name [.1] John
| Noun [.1] pits
| Article; Noun [.25] the + wumpus
| Article; Adjs; Noun [.05] the + smelly dead + wumpus
| Digit; Digit [.05] 34
| NP ; PP [.1] the wumpus + in 1 3
| NP ; RelClause [.05] the wumpus + that is smelly
VP → Verb [.25] stinks
| TransVerb; NP [.25] see + the Wumpus
| VP ; NP [.25] feel + a breeze
| VP ; Adjective [.05] is + smelly
| VP ; PP [.1] turn + to the east
| VP ; Adverb [.1] go + ahead
Adjs → Adjective [.8] smelly
| Adjective; Adjs [.2] smelly + dead
PP → Prep; NP [1] to + the east
RelClause → that; VP [1] that + is smelly
PCFG Parsing
Example 33.2.12. Reconsidering Example 33.2.6 with the Wumpus grammar
above, we get the PCFG parse tree:
S
.9
VP
.25
NP NP
.3 .25
Pronoun TransVerb Article Noun
.1 .1 .4 .15
I shoot the Wumpus
Note: two S-rooted subtrees, one with NP−SBJ−2 child and one with NP SBJ.
Ambiguity
Squad helps dog bite victim.
Helicopter powered by human flies.
American pushes bottle up Germans.
I ate spaghetti with meatballs.
salad.
abandon.
a fork.
a friend.
Anaphora
Using pronouns to refer back to entities already introduced in the text
After Mary proposed to John, they found a preacher and got married.
For the honeymoon, they went to Hawaii.
Mary saw a ring through the window and asked John for it.
Mary threw a rock at the window and broke it.
Indexicality
33.3. REAL LANGUAGE 677
Metonymy
Using one noun phrase to stand for another
Metaphor
“Non-literal” usage of words and phrases, often systematic:
I’ve tried killing the process but it won’t die. Its parent keeps it alive.
Noncompositionality
basketball shoes
baby shoes
alligator shoes
designer shoes
brake shoes
Noncompositionality
small moon
mere child
alleged murderer
real leather
678 CHAPTER 33. NATURAL LANGUAGE FOR COMMUNICATION
artificial grass
Planning
Planning Frameworks
Planning Algorithms
Planning and Acting in the real world
679
680 CHAPTER 34. WHAT DID WE LEARN IN AI 1/2?
Agent Sensors
Percepts
Environment
?
Actions
Actuators
Figure 2.1 Agents interact with environments through sensors and actuators.
Section 2.4. Simple
The Structure of Agents
Reflex Agents 49
there is to say about the agent. Mathematically speaking, we say that an agent’s behavior is
AGENT FUNCTION described by the agent Agent function that maps any given percept sequence to an action.
Sensors
We can imagine tabulating the agent function that describes any given agent; for most
agents, this would be a very large table—infinite, Whatin the fact,
world unless we place a bound on the
is like now
length of percept sequences we want to consider. Given an agent to experiment with, we can,
Environment
in principle, construct this table by trying out all possible percept sequences and recording
which actions the agent does in response.1 The table is, of course, an external characterization
of the agent. Internally, the agent function for an artificial agent will be implemented by an
AGENT PROGRAM agent program. It is important to keep these two ideas distinct. The agent function is an
abstract mathematical Condition-action
description; rules
the agent program is aI concrete implementation, running
What action
should do now
within some physical system.
To illustrate these ideas, we use a very simple example—the vacuum-cleaner world
Actuators
shown in Figure 2.2. This world is so simple that we can describe everything that happens;
it’s also a made-up world, so we can invent many variations. This particular world has just two
Figuresquares
locations: 2.9 Schematic
A and B.diagram of a simple
The vacuum agentreflex agent. which square it is in and whether
perceives
Reflex
there isAgents
dirt in with State It can choose to move left, move right, suck up the dirt, or do
the square.
nothing. One very simple agent function is the following: if the current square is dirty, then
suck; otherwise,
function S IMPLEmove to the-Aother
-R EFLEX GENTsquare.
( perceptA partial tabulation
) returns an action of this agent function is shown
persistent:
in Figure 2.3 and an agent
rules, program
a set of that implements
condition–action rules it appears in Figure 2.8 on page 48.
Looking
state at Figure-I2.3,
← I NTERPRET we see that various vacuum-world agents can be defined simply
NPUT( percept )
by filling
rule ←in Rthe
ULEright-hand column
-M ATCH(state, in various ways. The obvious question, then, is this: What
rules)
is theaction
right ←way to fill
rule.ACTION out the table? In other words, what makes an agent good or bad,
intelligent
returnor stupid? We answer these questions in the next section.
action
1 If the agent uses some randomization to choose its actions, then we would have to try each sequence many
Figure 2.10 A simple reflex agent. It acts according to a rule whose condition matches
times to identify the probability of each action. One might imagine that acting randomly is rather silly, but we
the current
show later state, as
in this chapter defined
that byvery
it can be the intelligent.
percept.
trivial; it gets more interesting shortly.) We use rectangles to denote the current internal state
of the agent’s decision process, and ovals to represent the background information used in
the process. The agent program, which is also very simple, is shown in Figure 2.10. The
I NTERPRET-I NPUT function generates an abstracted description of the current state from the
Section 2.4. The Structure of Agents 51 681
Sensors
State
How the world evolves What the world
is like now
Environment
What my actions do
Agent Actuators
Environment
What it will be like
state ← U PDATE -S What (state,
TATEmy action
actions do , percept ,ifmodel ) A
I do action
rule ← RULE -M ATCH(state, rules)
action ← rule.ACTION
return action
What action I
Goals should do now
Figure 2.12 A model-based reflex agent. It keeps track of the current state of the world,
using an internal model. It then chooses an action in the same way as the reflex agent.
Agent Actuators
Performance standard
Critic Sensors
feedback
Environment
changes
Learning Performance
element element
knowledge
learning
goals
Problem
generator
Actuators
Agent
He estimates how much work this might take and concludes “Some more expeditious method
seems desirable.” The method he proposes is to build learning machines and then to teach
Rational
them. InAgentmany areas of AI, this is now the preferred method for creating state-of-the-art
systems. Learning has another advantage, as we noted earlier: it allows the agent to operate
in initially
Idea: Tryunknown
to design environments
agents that andare
to become
successfulmore competent than(do its initial knowledge
the right thing)
alone might allow. In this section, we briefly introduce the main ideas of learning agents.
Throughout 34.0.1.
Definition the book, An we agent
comment on opportunities
is called rational, ifandit methods
chooses for learning in
whichever particular
action max-
kinds of agents. Part V goes into much more depth on the learning
imizes the expected value of the performance measure given the percept sequence algorithms themselves.
to date.A learning agent can
This is called thebeMEU divided into four conceptual components, as shown in Fig-
principle.
LEARNING ELEMENT ure 2.15. The most important distinction is between the learning element, which is re-
PERFORMANCE
ELEMENT Note:
sponsibleAfor makingagent
rational improvements,
need notand bethe performance element, which is responsible for
perfect
selecting external actions. The performance element is what we have previously considered
to only
be theneeds to maximize
entire agent: it takes inexpected
percepts value
and decides on actions. The (rational
learning omniscient)
̸=element uses
CRITIC feedback
need from critic on
notthepredict e.g.how verytheunlikely
agent is butdoing and determines
catastrophic how in
events thetheperformance
future
element should be modified to do better in the future.
percepts
The designmayofnot supply all
the learning relevant
element information
depends very much on the(Rational
design of the clairvoyant)
̸= performance
element.
if weWhen tryingperceive
cannot to design things
an agentwe that
dolearns a certain
not need capability,
to react the first question is
to them.
not “How am I going to get it to learn this?” but
but we may need to try to find out about hidden dangers
“What kind of performance element will my
(exploration)
agent need to do this once it has learned how?” Given an agent design, learning mechanisms
action
can outcomes
be constructed may not
to improve be as
every partexpected
of the agent. (rational ̸= successful)
but we may need to take action to ensure that they dowith
The critic tells the learning element how well the agent is doing (morerespect to a fixed
often)
performance standard.
(learning) The critic is necessary because the percepts themselves provide no
indication of the agent’s success. For example, a chess program could receive a percept
Rational
indicating; thatexploration, learning,
it has checkmated autonomy
its opponent, but it needs a performance standard to know
that this is a good thing; the percept itself does not say so. It is important that the performance
System 1 can
see whether an object is near or far
complete the phrase war and . . .
display disgust when seeing a gruesome
image
solve 2+2=?
read text on a billboard System 2
System 2 can
look out for the woman with the grey hair
sustain a higher than normal walking rate
count the number of A’s in a certain text System 1
give someone your phone number
park into a tight parking space
solve 17 × 24
determine the validity of a complex argu-
ment
critically
guides
massively
extends reach
System 1 Subsymbolic AI
unintentionally influences
System 1 subsymbolic AI
System 2 symbolic AI
Uncertainty
Probabilistic Reasoning
Making Decisions in Episodic Environments
Problem Solving in Sequential Environments
685
[Aus62] John Langshaw Austin. How to do things with words. William James Lectures. Oxford
University Press, 1962.
[Bac00] Fahiem Bacchus. Subset of PDDL for the AIPS2000 Planning Competition. The AIPS-
00 Planning Competition Comitee. 2000.
[BF95] Avrim L. Blum and Merrick L. Furst. “Fast planning through planning graph analysis”.
In: Proceedings of the 14th International Joint Conference on Artificial Intelligence
(IJCAI). Ed. by Chris S. Mellish. Montreal, Canada: Morgan Kaufmann, San Mateo,
CA, 1995, pp. 1636–1642.
[BF97] Avrim L. Blum and Merrick L. Furst. “Fast planning through planning graph analysis”.
In: Artificial Intelligence 90.1-2 (1997), pp. 279–298.
[BG01] Blai Bonet and Héctor Geffner. “Planning as Heuristic Search”. In: Artificial Intelli-
gence 129.1–2 (2001), pp. 5–33.
[BG99] Blai Bonet and Héctor Geffner. “Planning as Heuristic Search: New Results”. In:
Proceedings of the 5th European Conference on Planning (ECP’99). Ed. by S. Biundo
and M. Fox. Springer-Verlag, 1999, pp. 60–72.
[BKS04] Paul Beame, Henry A. Kautz, and Ashish Sabharwal. “Towards Understanding and
Harnessing the Potential of Clause Learning”. In: Journal of Artificial Intelligence
Research 22 (2004), pp. 319–351.
[Bon+12] Blai Bonet et al., eds. Proceedings of the 22nd International Conference on Automated
Planning and Scheduling (ICAPS’12). AAAI Press, 2012.
[Bro90] Rodney Brooks. In: Robotics and Autonomous Systems 6.1–2 (1990), pp. 3–15. doi:
10.1016/S0921-8890(05)80025-9.
[Byl94] Tom Bylander. “The Computational Complexity of Propositional STRIPS Planning”.
In: Artificial Intelligence 69.1–2 (1994), pp. 165–204.
[Cho65] Noam Chomsky. Syntactic structures. Den Haag: Mouton, 1965.
[CKT91] Peter Cheeseman, Bob Kanefsky, and William M. Taylor. “Where the Really Hard
Problems Are”. In: Proceedings of the 12th International Joint Conference on Artificial
Intelligence (IJCAI). Ed. by John Mylopoulos and Ray Reiter. Sydney, Australia:
Morgan Kaufmann, San Mateo, CA, 1991, pp. 331–337.
[CM85] Eugene Charniak and Drew McDermott. Introduction to Artificial Intelligence. Ad-
dison Wesley, 1985.
[CQ69] Allan M. Collins and M. Ross Quillian. “Retrieval time from semantic memory”. In:
Journal of verbal learning and verbal behavior 8.2 (1969), pp. 240–247. doi: 10.1016/
S0022-5371(69)80069-1.
[Dav67] Donald Davidson. “Truth and Meaning”. In: Synthese 17 (1967).
[DCM12] DCMI Usage Board. DCMI Metadata Terms. DCMI Recommendation. Dublin Core
Metadata Initiative, June 14, 2012. url: http : / / dublincore . org / documents /
2012/06/14/dcmi-terms/.
687
688 BIBLIOGRAPHY
[DF31] B. De Finetti. “Sul significato soggettivo della probabilita”. In: Fundamenta Mathe-
maticae 17 (1931), pp. 298–329.
[DHK15] Carmel Domshlak, Jörg Hoffmann, and Michael Katz. “Red-Black Planning: A New
Systematic Approach to Partial Delete Relaxation”. In: Artificial Intelligence 221
(2015), pp. 73–114.
[Ede01] Stefan Edelkamp. “Planning with Pattern Databases”. In: Proceedings of the 6th Eu-
ropean Conference on Planning (ECP’01). Ed. by A. Cesta and D. Borrajo. Springer-
Verlag, 2001, pp. 13–24.
[FD14] Zohar Feldman and Carmel Domshlak. “Simple Regret Optimization in Online Plan-
ning for Markov Decision Processes”. In: Journal of Artificial Intelligence Research
51 (2014), pp. 165–205.
[Fis] John R. Fisher. prolog :- tutorial. url: https://www.cpp.edu/~jrfisher/www/
prolog_tutorial/ (visited on 10/10/2019).
[FL03] Maria Fox and Derek Long. “PDDL2.1: An Extension to PDDL for Expressing Tem-
poral Planning Domains”. In: Journal of Artificial Intelligence Research 20 (2003),
pp. 61–124.
[Fla94] Peter Flach. Wiley, 1994. isbn: 0471 94152 2. url: https://github.com/simply-
logical/simply-logical/releases/download/v1.0/SL.pdf.
[FN71] Richard E. Fikes and Nils Nilsson. “STRIPS: A New Approach to the Application of
Theorem Proving to Problem Solving”. In: Artificial Intelligence 2 (1971), pp. 189–
208.
[Gen34] Gerhard Gentzen. “Untersuchungen über das logische Schließen I”. In: Mathematische
Zeitschrift 39.2 (1934), pp. 176–210.
[Ger+09] Alfonso Gerevini et al. “Deterministic planning in the fifth international planning
competition: PDDL3 and experimental evaluation of the planners”. In: Artificial In-
telligence 173.5-6 (2009), pp. 619–668.
[GJ79] Michael R. Garey and David S. Johnson. Computers and Intractability—A Guide to
the Theory of NP-Completeness. BN book: Freeman, 1979.
[Glo] Grundlagen der Logik in der Informatik. Course notes at https://www8.cs.fau.de/
_media/ws16:gloin:skript.pdf. url: https://www8.cs.fau.de/_media/ws16:
gloin:skript.pdf (visited on 10/13/2017).
[GNT04] Malik Ghallab, Dana Nau, and Paolo Traverso. Automated Planning: Theory and
Practice. Morgan Kaufmann, 2004.
[GS05] Carla Gomes and Bart Selman. “Can get satisfaction”. In: Nature 435 (2005), pp. 751–
752.
[GSS03] Alfonso Gerevini, Alessandro Saetti, and Ivan Serina. “Planning through Stochas-
tic Local Search and Temporal Action Graphs”. In: Journal of Artificial Intelligence
Research 20 (2003), pp. 239–290.
[Hau85] John Haugeland. Artificial intelligence: the very idea. Massachusetts Institute of Tech-
nology, 1985.
[HD09] Malte Helmert and Carmel Domshlak. “Landmarks, Critical Paths and Abstractions:
What’s the Difference Anyway?” In: Proceedings of the 19th International Conference
on Automated Planning and Scheduling (ICAPS’09). Ed. by Alfonso Gerevini et al.
AAAI Press, 2009, pp. 162–169.
[HE05] Jörg Hoffmann and Stefan Edelkamp. “The Deterministic Part of IPC-4: An Overview”.
In: Journal of Artificial Intelligence Research 24 (2005), pp. 519–579.
[Hel06] Malte Helmert. “The Fast Downward Planning System”. In: Journal of Artificial In-
telligence Research 26 (2006), pp. 191–246.
BIBLIOGRAPHY 689
[Her+13a] Ivan Herman et al. RDF 1.1 Primer (Second Edition). Rich Structured Data Markup
for Web Documents. W3C Working Group Note. World Wide Web Consortium (W3C),
2013. url: http://www.w3.org/TR/rdfa-primer.
[Her+13b] Ivan Herman et al. RDFa 1.1 Primer – Second Edition. Rich Structured Data Markup
for Web Documents. W3C Working Goup Note. World Wide Web Consortium (W3C),
Apr. 19, 2013. url: http://www.w3.org/TR/xhtml-rdfa-primer/.
[HG00] Patrik Haslum and Hector Geffner. “Admissible Heuristics for Optimal Planning”. In:
Proceedings of the 5th International Conference on Artificial Intelligence Planning
Systems (AIPS’00). Ed. by S. Chien, R. Kambhampati, and C. Knoblock. Brecken-
ridge, CO: AAAI Press, Menlo Park, 2000, pp. 140–149.
[HG08] Malte Helmert and Hector Geffner. “Unifying the Causal Graph and Additive Heuris-
tics”. In: Proceedings of the 18th International Conference on Automated Planning
and Scheduling (ICAPS’08). Ed. by Jussi Rintanen et al. AAAI Press, 2008, pp. 140–
147.
[HHH07] Malte Helmert, Patrik Haslum, and Jörg Hoffmann. “Flexible Abstraction Heuristics
for Optimal Sequential Planning”. In: Proceedings of the 17th International Conference
on Automated Planning and Scheduling (ICAPS’07). Ed. by Mark Boddy, Maria
Fox, and Sylvie Thiebaux. Providence, Rhode Island, USA: Morgan Kaufmann, 2007,
pp. 176–183.
[Hit+12] Pascal Hitzler et al. OWL 2 Web Ontology Language Primer (Second Edition). W3C
Recommendation. World Wide Web Consortium (W3C), 2012. url: http://www.
w3.org/TR/owl-primer.
[HN01] Jörg Hoffmann and Bernhard Nebel. “The FF Planning System: Fast Plan Generation
Through Heuristic Search”. In: Journal of Artificial Intelligence Research 14 (2001),
pp. 253–302.
[Hof05] Jörg Hoffmann. “Where ‘Ignoring Delete Lists’ Works: Local Search Topology in Plan-
ning Benchmarks”. In: Journal of Artificial Intelligence Research 24 (2005), pp. 685–
758.
[Hof11] Jörg Hoffmann. “Every806thing You Always Wanted to Know about Planning (But
Were Afraid to Ask)”. In: Proceedings of the 34th Annual German Conference on
Artificial Intelligence (KI’11). Ed. by Joscha Bach and Stefan Edelkamp. Vol. 7006.
Lecture Notes in Computer Science. Springer, 2011, pp. 1–13. url: http://fai.cs.
uni-saarland.de/hoffmann/papers/ki11.pdf.
[How60] R. A. Howard. Dynamic Programming and Markov Processes. MIT Press, 1960.
[ILD] 7. Constraints: Interpreting Line Drawings. url: https://www.youtube.com/watch?
v=l-tzjenXrvI&t=2037s (visited on 11/19/2019).
[JN33] E. S. Pearson J. Neyman. “IX. On the problem of the most efficient tests of statis-
tical hypotheses”. In: Philosophical Transactions of the Royal Society of London A:
Mathematical, Physical and Engineering Sciences 231.694-706 (1933), pp. 289–337.
doi: 10.1098/rsta.1933.0009.
[Kah11] Daniel Kahneman. Thinking, fast and slow. Penguin Books, 2011. isbn: 9780141033570.
[KC04] Graham Klyne and Jeremy J. Carroll. Resource Description Framework (RDF): Con-
cepts and Abstract Syntax. W3C Recommendation. World Wide Web Consortium
(W3C), Feb. 10, 2004. url: http://www.w3.org/TR/2004/REC- rdf- concepts-
20040210/.
[KD09] Erez Karpas and Carmel Domshlak. “Cost-Optimal Planning with Landmarks”. In:
Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJ-
CAI’09). Ed. by C. Boutilier. Pasadena, California, USA: Morgan Kaufmann, July
2009, pp. 1728–1733.
690 BIBLIOGRAPHY
[Met+53] N. Metropolis et al. “Equations of state calculations by fast computing machines”. In:
Journal of Chemical Physics 21 (1953), pp. 1087–1091.
[Min] Minion - Constraint Modelling. System Web page at http://constraintmodelling.
org/minion/. url: http://constraintmodelling.org/minion/.
[MMS93] Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. “Building a
large annotated corpus of English: the penn treebank”. In: Computational Linguistics
19.2 (1993), pp. 313–330.
[MR91] John Mylopoulos and Ray Reiter, eds. Sydney, Australia: Morgan Kaufmann, San
Mateo, CA, 1991.
[MSL92] David Mitchell, Bart Selman, and Hector J. Levesque. “Hard and Easy Distributions
of SAT Problems”. In: Proceedings of the 10th National Conference of the American
Association for Artificial Intelligence (AAAI’92). San Jose, CA: MIT Press, 1992,
pp. 459–465.
[NHH11] Raz Nissim, Jörg Hoffmann, and Malte Helmert. “Computing Perfect Heuristics in
Polynomial Time: On Bisimulation and Merge-and-Shrink Abstraction in Optimal
Planning”. In: Proceedings of the 22nd International Joint Conference on Artificial
Intelligence (IJCAI’11). Ed. by Toby Walsh. AAAI Press/IJCAI, 2011, pp. 1983–
1990.
[Nor+18a] Emily Nordmann et al. Lecture capture: Practical recommendations for students and
lecturers. 2018. url: https://osf.io/huydx/download.
[Nor+18b] Emily Nordmann et al. Vorlesungsaufzeichnungen nutzen: Eine Anleitung für Studierende.
2018. url: https://osf.io/e6r7a/download.
[NS63] Allen Newell and Herbert Simon. “GPS, a program that simulates human thought”.
In: Computers and Thought. Ed. by E. Feigenbaum and J. Feldman. McGraw-Hill,
1963, pp. 279–293.
[NS76] Alan Newell and Herbert A. Simon. “Computer Science as Empirical Inquiry: Symbols
and Search”. In: Communications of the ACM 19.3 (1976), pp. 113–126. doi: 10.
1145/360018.360022.
[OWL09] OWL Working Group. OWL 2 Web Ontology Language: Document Overview. W3C
Recommendation. World Wide Web Consortium (W3C), Oct. 27, 2009. url: http:
//www.w3.org/TR/2009/REC-owl2-overview-20091027/.
[PD09] Knot Pipatsrisawat and Adnan Darwiche. “On the Power of Clause-Learning SAT
Solvers with Restarts”. In: Proceedings of the 15th International Conference on Princi-
ples and Practice of Constraint Programming (CP’09). Ed. by Ian P. Gent. Vol. 5732.
Lecture Notes in Computer Science. Springer, 2009, pp. 654–668.
[Pól73] George Pólya. How to Solve it. A New Aspect of Mathematical Method. Princeton
University Press, 1973.
[Pra+94] Malcolm Pradhan et al. “Knowledge Engineering for Large Belief Networks”. In:
Proceedings of the Tenth International Conference on Uncertainty in Artificial In-
telligence. UAI’94. Seattle, WA: Morgan Kaufmann Publishers Inc., 1994, pp. 484–
490. isbn: 1-55860-332-8. url: http://dl.acm.org/citation.cfm?id=2074394.
2074456.
[Pro] Protégé. Project Home page at http : / / protege . stanford . edu. url: http : / /
protege.stanford.edu.
[PRR97] G. Probst, St. Raub, and Kai Romhardt. Wissen managen. 4 (2003). Gabler Verlag,
1997.
[PS08] Eric Prud’hommeaux and Andy Seaborne. SPARQL Query Language for RDF. W3C
Recommendation. World Wide Web Consortium (W3C), Jan. 15, 2008. url: http:
//www.w3.org/TR/2008/REC-rdf-sparql-query-20080115/.
692 BIBLIOGRAPHY
[PW92] J. Scott Penberthy and Daniel S. Weld. “UCPOP: A Sound, Complete, Partial Order
Planner for ADL”. In: Principles of Knowledge Representation and Reasoning: Pro-
ceedings of the 3rd International Conference (KR-92). Ed. by B. Nebel, W. Swartout,
and C. Rich. Cambridge, MA: Morgan Kaufmann, Oct. 1992, pp. 103–114. url: ftp:
//ftp.cs.washington.edu/pub/ai/ucpop-kr92.ps.Z.
[RHN06] Jussi Rintanen, Keijo Heljanko, and Ilkka Niemelä. “Planning as satisfiability: parallel
plans and algorithms for plan search”. In: Artificial Intelligence 170.12-13 (2006),
pp. 1031–1080.
[Rin10] Jussi Rintanen. “Heuristics for Planning with SAT”. In: Proceeedings of the 16th In-
ternational Conference on Principles and Practice of Constraint Programming. 2010,
pp. 414–428.
[RN03] Stuart J. Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. 2nd ed.
Pearso n Education, 2003. isbn: 0137903952.
[RN09] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. 3rd.
Prentice Hall Press, 2009. isbn: 0136042597, 9780136042594.
[RN95] Stuart J. Russell and Peter Norvig. Artificial Intelligence — A Modern Approach.
Upper Saddle River, NJ: Prentice Hall, 1995.
[RW10] Silvia Richter and Matthias Westphal. “The LAMA Planner: Guiding Cost-Based
Anytime Planning with Landmarks”. In: Journal of Artificial Intelligence Research
39 (2010), pp. 127–177.
[RW91] S. J. Russell and E. Wefald. Do the Right Thing — Studies in limited Rationality.
MIT Press, 1991.
[Sea69] John R. Searle. Speech Acts: An Essay in the Philosophy of Language. Cambridge,
London: Cambridge University Press, 1969.
[Sil+16] David Silver et al. “Mastering the Game of Go with Deep Neural Networks and Tree
Search”. In: Nature 529 (2016), pp. 484–503. url: http://www.nature.com/nature/
journal/v529/n7587/full/nature16961.html.
[Smu63] Raymond M. Smullyan. “A Unifying Principle for Quantification Theory”. In: Proc.
Nat. Acad Sciences 49 (1963), pp. 828–832.
[SR14] Guus Schreiber and Yves Raimond. RDF 1.1 Primer. W3C Working Group Note.
World Wide Web Consortium (W3C), 2014. url: http://www.w3.org/TR/rdf-
primer.
[SR91] C. Samuelsson and M. Rayner. “Quantitative evaluation of explanation-based learning
as an optimization tool for a large-scale natural language system”. In: Proceedings of
the 12th International Joint Conference on Artificial Intelligence (IJCAI). Ed. by
John Mylopoulos and Ray Reiter. Sydney, Australia: Morgan Kaufmann, San Mateo,
CA, 1991, pp. 609–615.
[sTeX] sTeX: A semantic Extension of TeX/LaTeX. url: https://github.com/sLaTeX/
sTeX (visited on 05/11/2020).
[SWI] SWI Prolog Reference Manual. url: https://www.swi-prolog.org/pldoc/refman/
(visited on 10/10/2019).
[Tur50] Alan Turing. “Computing Machinery and Intelligence”. In: Mind 59 (1950), pp. 433–
460.
[Wal75] David Waltz. “Understanding Line Drawings of Scenes with Shadows”. In: The Psy-
chology of Computer Vision. Ed. by P. H. Winston. McGraw-Hill, 1975, pp. 1–19.
[WHI] Human intelligence — Wikipedia The Free Encyclopedia. url: https://en.wikipedia.
org/w/index.php?title=Human_intelligence (visited on 04/09/2018).
[Wit53] Ludwig Wittgenstein. Philosophical Investigations. Basil Blackwell, 1953. isbn: 0631119000.
Part VIII
Excursions
693
695
As this course is predominantly an overview over the topics of Artificial Intelligence, and not
about the theoretical underpinnings, we give the discussion about these as a “suggested readings”
part here.
696
Appendix A
The next step is to analyze the two calculi for completeness. For that we will first give ourselves
a very powerful tool: the “model existence theorem” (??), which encapsulates the model-theoretic
part of completeness theorems. With that, completeness proofs – which are quite tedious otherwise
– become a breeze.
697
698 APPENDIX A. COMPLETENESS OF CALCULI FOR PROPOSITIONAL LOGIC
Corollary: C is complete.
The proof of the model existence theorem goes via the notion of a Hintikka set, a set of
formulae with very strong syntactic closure properties, which allow to read off models. Jaako
Hintikka’s original idea for completeness proofs was that for every complete calculus C and every
C-consistent set one can induce a Hintikka set, from which a model can be constructed. This can
be considered as a first model existence theorem. However, the process of obtaining a Hintikka set
for a C-consistent set Φ of sentences usually involves complicated calculus dependent constructions.
In this situation, Raymond Smullyan was able to formulate the sufficient conditions for the
existence of Hintikka sets in the form of “abstract consistency properties” by isolating the calculus
independent parts of the Hintikka set construction. His technique allows to reformulate Hintikka
sets as maximal elements of abstract consistency classes and interpret the Hintikka set construction
as a maximizing limit process.
To carry out the “model-existence”/“abstract consistency” method, we will first have to look at
the notion of consistency.
Consistency and refutability are very important notions when studying the completeness for
calculi; they form syntactic counterparts of satisfiability.
Consistency
Let C be a calculus,. . .
Definition A.1.1. Let C be a calculus, then a formula set Φ is called C-, if we can
derive a contradiction from it.
Definition A.1.2. We call a pair of formulae A and ¬A a contradiction.
So a set Φ is C-refutable, if C canderive a contradiction from it.
Definition A.1.3. Let C be a calculus, then a formula set Φ is called C-, iff there
is a formula B, that is not derivable from Φ in C.
Definition A.1.4. We call a calculus C reasonable, iff implication elimination and
conjunction introduction are admissible in C and A ∧ ¬A ⇒ B is a C-theorem.
Theorem A.1.5. C-inconsistency and C-refutability coincide for reasonable calculi.
It is very important to distinguish the syntactic C-refutability and C-consistency from satisfiability,
which is a property of formulae that is at the heart of semantics. Note that the former have the
calculus (a syntactic device) as a parameter, while the latter does not. In fact we should actually
say S-satisfiability, where ⟨L, K, |=⟩ is the current logical system.
Even the word “contradiction” has a syntactical flavor to it, it translates to “saying against
each other” from its Latin root.
Abstract Consistency
A.1. ABSTRACT CONSISTENCY AND MODEL EXISTENCE 699
Definition A.1.6. Let ∇ be a family of sets. We call ∇ closed under subsets, iff
for each Φ∈∇, all subsets Ψ ⊆ Φ are elements of ∇.
Notation: We will use Φ∗A for Φ ∪ {A}.
So a family of sets (we call it a family, so that we do not have to say “set of sets” and we can
distinguish the levels) is an abstract consistency class, iff it fulfills five simple conditions, of which
the last three are closure conditions.
Think of an abstract consistency class as a family of “consistent” sets (e.g. C-consistent for some
calculus C), then the properties make perfect sense: They are naturally closed under subsets — if
we cannot derive a contradiction from a large set, we certainly cannot from a subset, furthermore,
∇c ) If both P ∈Φ and ¬P ∈Φ, then Φ cannot be “consistent”.
∇¬ ) If we cannot derive a contradiction from Φ with ¬¬A∈Φ then we cannot from Φ∗A, since they
are logically equivalent.
The other two conditions are motivated similarly. We will carry out the proof here, since it
gives us practice in dealing with the abstract consistency properties.
The main result here is that abstract consistency classes can be extended to compact ones. The
proof is quite tedious, but relatively straightforward. It allows us to assume that all abstract
consistency classes are compact in the first place (otherwise we pass to the compact extension).
Actually we are after abstract consistency classes that have an even stronger property than just
being closed under subsets. This will allow us to carry out a limit construction in the Hintikka
set extension argument later.
Compact Collections
Definition A.1.11. We call a collection ∇ of sets compact, iff for any set Φ we
have
Φ∈∇, iff Ψ∈∇ for every finite subset Ψ of Φ.
Lemma A.1.12. If ∇ is compact, then ∇ is closed under subsets.
Proof:
1. Suppose S ⊆ T and T ∈∇.
2. Every finite subset A of S is a finite subset of T .
3. As ∇ is compact, we know that A∈∇.
700 APPENDIX A. COMPLETENESS OF CALCULI FOR PROPOSITIONAL LOGIC
4. Thus S∈∇.
The property of being closed under subsets is a “downwards-oriented” property: We go from large
sets to small sets, compactness (the interesting direction anyways) is also an “upwards-oriented”
property. We can go from small (finite) sets to large (infinite) sets. The main application for the
compactness condition will be to show that infinite sets of formulae are in a family ∇ by testing
all their finite subsets (which is much simpler).
Proof:
1. We choose ∇′ :={Φ ⊆ wff0 (V0 )|every finite subset of Φ is in ∇}.
2. Now suppose that Φ∈∇. ∇ is closed under subsets, so every finite subset of
Φ is in ∇ and thus Φ∈∇′ . Hence ∇ ⊆ ∇′ .
3. Next let us show that each ∇ is compact.’
3.1. Suppose Φ∈∇′ and Ψ is an arbitrary finite subset of Φ.
3.2. By definition of ∇′ all finite subsets of Φ are in ∇ and therefore Ψ∈∇′ .
3.3. Thus all finite subsets of Φ are in ∇′ whenever Φ is in ∇′ .
3.4. On the other hand, suppose all finite subsets of Φ are in ∇′ .
3.5. Then by the definition of ∇′ the finite subsets of Φ are also in ∇, so Φ∈∇′ .
Thus ∇′ is compact.
4. Note that ∇′ is closed under subsets by the Lemma above.
5. Now we show that if ∇ satisfies ∇∗ , then ∇ satisfies ∇∗ .’
5.1. To show ∇c , let Φ∈∇′ and suppose there is an atom A, such that {A, ¬A} ⊆
Φ. Then {A, ¬A}∈∇ contradicting ∇c .
5.2. To show ∇¬ , let Φ∈∇′ and ¬¬A∈Φ, then Φ∗A∈∇′ .
5.2.1. Let Ψ be any finite subset of Φ∗A, and Θ:=(Ψ\{A})∗¬¬A.
5.2.2. Θ is a finite subset of Φ, so Θ∈∇.
5.2.3. Since ∇ is an abstract consistency class and ¬¬A∈Θ, we get Θ∗A∈∇
by ∇¬ .
5.2.4. We know that Ψ ⊆ Θ∗A and ∇ is closed under subsets, so Ψ∈∇.
5.2.5. Thus every finite subset Ψ of Φ∗A is in ∇ and therefore by definition
Φ∗A∈∇′ .
5.3. the other cases are analogous to ∇¬ .
Hintikka sets are sets of sentences with very strong analytic closure conditions. These are motivated
as maximally consistent sets i.e. sets that already contain everything that can be consistently
added to them.
∇-Hintikka Set
Definition A.1.14. Let ∇ be an abstract consistency class, then we call a set
H∈∇ a ∇ Hintikka Set, iff H is maximal in ∇, i.e. for all A with H∗A∈∇ we
already have A∈H.
A.1. ABSTRACT CONSISTENCY AND MODEL EXISTENCE 701
∇-Hintikka Set
Proof:
We prove the properties in turn
1. Hc by induction on the structure of A
1.1. A∈V0 Then A̸∈H or ¬A̸∈H by ∇c .
1.2. A = ¬B
1.2.1. Let us assume that ¬B∈H and ¬¬B∈H,
1.2.2. then H∗B∈∇ by ∇¬ , and therefore B∈H by maximality.
1.2.3. So both B and ¬B are in H, which contradicts the inductive hy-
pothesis.
1.3. A = B ∨ C similar to the previous case
2. We prove H¬ by maximality of H in ∇.
2.1. If ¬¬A∈H, then H∗A∈∇ by ∇¬ .
2.2. The maximality of H now gives us that A∈H.
Proof sketch: other H∗ are similar
The following theorem is one of the main results in the “abstract consistency”/”model existence”
method. For any abstract consistent set Φ it allows us to construct a Hintikka set H with Φ∈H.
Extension Theorem
Theorem A.1.16. If ∇ is an abstract consistency class and Φ∈∇, then there is a
∇-Hintikka set H with Φ ⊆ H.
Proof:
1. Wlog. we assume that ∇ is compact (otherwise pass to compact extension)
2. We choose an enumeration A1 , . . . of the set wff0 (V0 )
3. and construct a sequence of sets Hi with H0 :=Φ and
Hn if (Hn ∗An ̸∈∇)
Hn+1 :=
Hn ∗An if (Hn ∗An ∈∇)
S
4. Note that all Hi ∈∇, choose H:= i∈N Hi
5. Ψ ⊆ H finite implies there is a j∈N such that Ψ ⊆ Hj ,
6. so Ψ∈∇ as ∇ closed under subsets and H∈∇ as ∇ is compact.
702 APPENDIX A. COMPLETENESS OF CALCULI FOR PROPOSITIONAL LOGIC
7. Let H∗B∈∇, then there is a j∈N with B = Aj , so that B∈Hj+1 and Hj+1 ⊆
H
8. Thus H is ∇-maximal
Note that the construction in the proof above is non-trivial in two respects. First, the limit
construction for H is not executed in our original abstract consistency class ∇, but in a suitably
extended one to make it compact — the original would not have contained H in general. Second,
the set H is not unique for Φ, but depends on the choice of the enumeration of wff0 (V0 ). If we pick a
different enumeration, we will end up with a different H. Say if A and ¬A are both ∇-consistent2
with Φ, then depending on which one is first in the enumeration H, will contain that one; with all
the consequences for subsequent choices in the construction process.
Valuation
Definition A.1.17. A function ν : wff0 (V0 )→Do is called a valuation, iff
ν(¬A) = T, iff ν(A) = F
ν(A ∧ B) = T, iff ν(A) = T and ν(B) = T
Lemma A.1.18. If ν : wff0 (V0 )→Do is a valuation and Φ ⊆ wff0 (V0 ) with ν(Φ) =
{T}, then Φ is satisfiable.
Proof sketch: ν|V0 : V0 →Do is a satisfying variable assignment.
Now, we only have to put the pieces together to obtain the model existence theorem we are after.
Model Existence
Lemma A.1.20 (Hintikka-Lemma). If ∇ is an abstract consistency class and H
a ∇-Hintikka set, then H is satisfiable.
Proof:
1. We define ν(A):=T, iff A∈H
2. then ν is a valuation by the Hintikka properties
3. and thus ν|V0 is a satisfying assignment.
Theorem A.1.21 (Model Existence). If ∇ is an abstract consistency class and
Φ∈∇, then Φ is satisfiable.
Proof:
1. There is a ∇-Hintikka set H with Φ ⊆ H (Extension Theorem)
2. We know that H is satisfiable. (Hintikka-Lemma)
3. In particular, Φ ⊆ H is satisfiable.
Observation: If we look at the completeness proof below, we see that the Lemma above is the
only place where we had to deal with specific properties of the T0 .
So if we want to prove completeness of any other calculus with respect to propositional logic,
then we only need to prove an analogon to this lemma and can use the rest of the machinery we
have already established “off the shelf”.
This is one great advantage of the “abstract consistency method”; the other is that the method
can be extended transparently to other logics.
Completeness of T0
704 APPENDIX A. COMPLETENESS OF CALCULI FOR PROPOSITIONAL LOGIC
We will now analyze the first-order calculi for completeness. Just as in the case of the propositional
calculi, we prove a model existence theorem for the first-order model theory and then use that
for the completeness proofs3 . The proof of the first-order model existence theorem is completely EdN:3
analogous to the propositional one; indeed, apart from the model construction itself, it is just an
extension by a treatment for the first-order quantifiers.4 EdN:4
705
706 APPENDIX B. COMPLETENESS OF CALCULI FOR FIRST-ORDER LOGIC
The proof of the model existence theorem goes via the notion of a Hintikka set, a set of
formulae with very strong syntactic closure properties, which allow to read off models. Jaako
Hintikka’s original idea for completeness proofs was that for every complete calculus C and every
C-consistent set one can induce a Hintikka set, from which a model can be constructed. This can
be considered as a first model existence theorem. However, the process of obtaining a Hintikka set
for a C-consistent set Φ of sentences usually involves complicated calculus dependent constructions.
In this situation, Raymond Smullyan was able to formulate the sufficient conditions for the
existence of Hintikka sets in the form of “abstract consistency properties” by isolating the calculus
independent parts of the Hintikka set construction. His technique allows to reformulate Hintikka
sets as maximal elements of abstract consistency classes and interpret the Hintikka set construction
as a maximizing limit process.
To carry out the “model-existence”/“abstract consistency” method, we will first have to look at
the notion of consistency.
Consistency and refutability are very important notions when studying the completeness for
calculi; they form syntactic counterparts of satisfiability.
Consistency
Let C be a calculus,. . .
Definition B.1.1. Let C be a calculus, then a formula set Φ is called C-, if we can
derive a contradiction from it.
Definition B.1.2. We call a pair of formulae A and ¬A a contradiction.
It is very important to distinguish the syntactic C-refutability and C-consistency from satisfiability,
which is a property of formulae that is at the heart of semantics. Note that the former have the
calculus (a syntactic device) as a parameter, while the latter does not. In fact we should actually
say S-satisfiability, where ⟨L, K, |=⟩ is the current logical system.
Even the word “contradiction” has a syntactical flavor to it, it translates to “saying against
each other” from its Latin root.
B.1. ABSTRACT CONSISTENCY AND MODEL EXISTENCE 707
The notion of an “abstract consistency class” provides the a calculus-independent notion of con-
sistency: A set Φ of sentences is considered “consistent in an abstract sense”, iff it is a member of
an abstract consistency class ∇.
Abstract Consistency
Definition B.1.6. Let ∇ be a family of sets. We call ∇ closed under subsets, iff
for each Φ∈∇, all subsets Ψ ⊆ Φ are elements of ∇.
Notation: We will use Φ∗A for Φ ∪ {A}.
The conditions are very natural: Take for instance ∇c , it would be foolish to call a set Φ of
sentences “consistent under a complete calculus”, if it contains an elementary contradiction. The
next condition ∇¬ says that if a set Φ that contains a sentence ¬¬A is “consistent”, then we should
be able to extend it by A without losing this property; in other words, a complete calculus should
be able to recognize A and ¬¬A to be equivalent. We will carry out the proof here, since it
gives us practice in dealing with the abstract consistency properties.
The main result here is that abstract consistency classes can be extended to compact ones. The
proof is quite tedious, but relatively straightforward. It allows us to assume that all abstract
consistency classes are compact in the first place (otherwise we pass to the compact extension).
Actually we are after abstract consistency classes that have an even stronger property than just
being closed under subsets. This will allow us to carry out a limit construction in the Hintikka
set extension argument later.
Compact Collections
Definition B.1.8. We call a collection ∇ of sets compact, iff for any set Φ we
have
Φ∈∇, iff Ψ∈∇ for every finite subset Ψ of Φ.
The property of being closed under subsets is a “downwards-oriented” property: We go from large
sets to small sets, compactness (the interesting direction anyways) is also an “upwards-oriented”
property. We can go from small (finite) sets to large (infinite) sets. The main application for the
compactness condition will be to show that infinite sets of formulae are in a family ∇ by testing
all their finite subsets (which is much simpler).
Hintikka sets are sets of sentences with very strong analytic closure conditions. These are motivated
as maximally consistent sets i.e. sets that already contain everything that can be consistently
added to them.
∇-Hintikka Set
Definition B.1.11. Let ∇ be an abstract consistency class, then we call a set
H∈∇ a ∇ Hintikka Set, iff H is maximal in ∇, i.e. for all A with H∗A∈∇ we
already have A∈H.
Theorem B.1.12 (Hintikka Properties). Let ∇ be an abstract consistency class
and H be a ∇-Hintikka set, then
B.1. ABSTRACT CONSISTENCY AND MODEL EXISTENCE 709
The following theorem is one of the main results in the “abstract consistency”/”model existence”
method. For any abstract consistent set Φ it allows us to construct a Hintikka set H with Φ∈H.
Extension Theorem
Theorem B.1.13. If ∇ is an abstract consistency class and Φ∈∇ finite, then there
is a ∇-Hintikka set H with Φ ⊆ H.
Proof:
1. Wlog. assume that ∇ compact (else use compact extension)
2. Choose an enumeration A1 , . . . of cwff o (Σι ) and c1 , . . . of Σsk
0 .
3. and construct a sequence of sets Hi with H0 :=Φ and
Hn if (Hn ∗An ̸∈∇)
Hn+1 := Hn ∪ {An , ¬([cn /X](B))} if (Hn ∗An ∈∇) and An = (¬(∀X B))
Hn ∗An else
S
4. Note that all Hi ∈∇, choose H:= i∈N Hi
5. Ψ ⊆ H finite implies there is a j∈N such that Ψ ⊆ Hj ,
6. so Ψ∈∇ as ∇ closed under subsets and H∈∇ as ∇ is compact.
7. Let H∗B∈∇, then there is a j∈N with B = Aj , so that B∈Hj+1 and Hj+1 ⊆
H
8. Thus H is ∇-maximal
Note that the construction in the proof above is non-trivial in two respects. First, the limit
construction for H is not executed in our original abstract consistency class ∇, but in a suitably
710 APPENDIX B. COMPLETENESS OF CALCULI FOR FIRST-ORDER LOGIC
extended one to make it compact — the original would not have contained H in general. Second,
the set H is not unique for Φ, but depends on the choice of the enumeration of cwff o (Σι ). If
we pick a different enumeration, we will end up with a different H. Say if A and ¬A are both
∇-consistent5 with Φ, then depending on which one is first in the enumeration H, will contain
that one; with all the consequences for subsequent choices in the construction process.
Valuations
Definition B.1.14. A function µ : cwff o (Σι )→Do is called a (first-order) valuation,
iff
µ(¬A) = T, iff µ(A) = F
µ(A ∧ B) = T, iff µ(A) = T and µ(B) = T
µ(∀X A) = T, iff µ([B/X](A)) = T for all closed terms B.
Lemma B.1.15. If φ : Vι →D is a variable assignment, then I φ : cwff o (Σι )→Do
is a valuation.
Thus a valuation is a weaker notion of evaluation in first-order logic; the other direction is also
true, even though the proof of this result is much more involved: The existence of a first-order
valuation that makes a set of sentences true entails the existence of a model that satisfies it.6
3.6. A = ∀X B
3.6.1. then I φ (A) = T, iff I ψ (B) = µ(ψ(B)) = T, for all C∈Dι , where
ψ = φ,[C/X]. This is the case, iff µ(φ(A)) = T.
4. Thus I φ (A)µ(φ(A)) = µ(A) = T for all A∈Φ.
5. Hence M|=A for M:=⟨Dι , I⟩.
Now, we only have to put the pieces together to obtain the model existence theorem we are after.
Model Existence
Theorem B.1.17 (Hintikka-Lemma). If ∇ is an abstract consistency class and
H a ∇-Hintikka set, then H is satisfiable.
Proof:
1. we define µ(A):=T, iff A∈H,
2. then µ is a valuation by the Hintikka set properties.
3. We have µ(H) = {T}, so H is satisfiable.
Theorem B.1.18 (Model Existence). If ∇ is an abstract consistency class and
Φ∈∇, then Φ is satisfiable.
Proof:
1. There is a ∇-Hintikka set H with Φ ⊆ H (Extension Theorem)
2. We know that H is satisfiable. (Hintikka-Lemma)
3. In particular, Φ ⊆ H is satisfiable.
This directly yields two important results that we will use for the completeness analysis.
Henkin’s Theorem
Corollary B.2.2 (Henkin’s Theorem). Every ND1 -consistent set of sentences
has a model.
Proof:
1. Let Φ be a ND1 -consistent set of sentences.
2. The class of sets of ND1 -consistent propositions constitute an abstract consis-
tency class.
3. Thus the model existence theorem guarantees a model for Φ.
Corollary B.2.3 (Löwenheim&Skolem Theorem). Satisfiable set Φ of first-order
sentences has a countable model.
Proof sketch: The model we constructed is countable, since the set of ground terms
is.
Now, the completeness result for first-order natural deduction is just a simple argument away.
We also get a compactness theorem (almost) for free: logical systems with a complete calculus are
always compact.
Soundness of T1f
Lemma B.3.1. Tableau rules transform satisfiable tableaux into satisfiable ones.
Proof:
we examine the tableau rules in turn
1. propositional rules as in propositional tableaux
2. T1f ∃ by ??
3. T1f⊥ by ?? (substitution value lemma)
4. T1f ∀
4.1. I φ (∀X A) = T, iff I ψ (A) = T for all a∈Dι
4.2. so in particular for some a∈Dι ̸= ∅.
Corollary B.3.2. T1f is correct.
The only interesting steps are the cut rule, which can be directly handled by the substitution
value lemma, and the rule for the existential quantifier, which we do in a separate lemma.
Soundness of T1f ∃
This proof is paradigmatic for soundness proofs for calculi with Skolemization. We use the axiom
of choice at the meta-level to choose a meaning for the Skolem function symbol.
Armed with the Model Existence Theorem for first-order logic (Theorem B.1.18), the com-
pleteness of first-order tableaux is similarly straightforward. We just have to show that the col-
lection of tableau-irrefutable sentences is an abstract consistency class, which is a simple proof-
transformation exercise in all but the universal quantifier case, which we postpone to its own
Lemma (Theorem B.3.5).
714 APPENDIX B. COMPLETENESS OF CALCULI FOR FIRST-ORDER LOGIC
Completeness of (T1f )
ΨT ΨT
F F
(∀X A) (∀X A)
F F
([c/X](A)) ([f (X 1 , . . ., X k )/X](A))
Rest [f (X 1 , . . ., X k )/c](Rest)
So we only have to treat the case for the universal quantifier. This is what we usually call a
“lifting argument”, since we have to transform (“lift”) a proof for a formula θ(A) to one for A. In
the case of tableaux we do that by an induction on the tableau refutation for θ(A) which creates
a tableau-isomorphism to a tableau refutation for A.
Tableau-Lifting
Theorem B.3.5. If Tθ is a closed tableau for a set θ(Φ) of formulae, then there is
a closed tableau T for Φ.
Again, the “lifting lemma for tableaux” is paradigmatic for lifting lemmata for other refutation
calculi.
Correctness (CNF)
Lemma B.4.1. A set Φ of sentences is satisfiable, iff CNF1 (Φ) is.
Proof: propositional rules and ∀-rule are trivial; do the ∃-rule
B.4. SOUNDNESS AND COMPLETENESS OF FIRST-ORDER RESOLUTION 715
F
1. Let (∀X A) satisfiable in M:=⟨D, I⟩ and free(A) = {X 1 , . . ., X n }
2. I φ (∀X A) = F, so there is an a∈D with I φ,[a/X] (A) = F (only depends on
φ|free(A) )
3. let g : Dn →D be defined by g(a1 , . . ., an ):=a, iff φ(X i ) = ai .
4. choose M′ :=⟨D, I ′ ⟩ with I(f )′ :=g, then I ′ φ ([f (X 1 , . . . , X k )/X](A)) = F
F
5. Thus ([f (X 1 , . . . , X k )/X](A)) is satisfiable in M′
Resolution (Correctness)
Definition B.4.2. A clause is called satisfiable, iff I φ (A) = α for one of its literals
Aα .
Lemma B.4.3. 2 is unsatisfiable
Lemma B.4.4. CNF transformations preserve satisfiability (see above)
Completeness (R1 )
Theorem B.4.6. R1 is refutation complete.
Proof: ∇:={Φ|ΦT has no closed tableau} is an abstract consistency class
1. as for propositional case.
2. by the lifting lemma below
F
3. Let T be a closed tableau for ¬(∀X A)∈Φ and ΦT ∗([c/X](A)) ∈∇.
F
4. CNF1 (ΦT ) = CNF1 (ΨT ) ∪ CNF1 (([f (X 1 , . . ., X k )/X](A)) )
F
5. ([f (X 1 , . . ., X k )/c](CNF1 (ΦT )))∗([c/X](A)) = CNF1 (ΦT )
6. so R1 : CNF1 (ΦT )⊢D′ 2, where D = [f (X1′ , . . . , Xk′ )/c](D).
Definition B.4.8. Let Φ and Ψ be clause sets, then we call a bijection Ω : Φ→Ψ
a clause set isomorphism, iff there is a clause isomorphism ω : C→Ω(C) for each
C∈Φ.
Lemma B.4.9. If θ(Φ) is set of formulae, then there is a θ-compatible clause set
isomorphism Ω : CNF1 (Φ)→CNF1 (θ(Φ)).
716 APPENDIX B. COMPLETENESS OF CALCULI FOR FIRST-ORDER LOGIC
Lifting for R1
Theorem B.4.10. If R1 : (θ(Φ))⊢Dθ 2 for a set θ(Φ) of formulae, then there is a
R1 -refutation for Φ.
Proof: by induction over Dθ we construct a R1 -derivation R1 : Φ⊢D C and a θ-
compatible clause set isomorphism Ω : D→Dθ
Dθ′ Dθ′′
T F
1. If Dθ ends in ((θ(A)) ∨ (θ(C))) (θ(B)) ∨ (θ(D))
res
(σ(θ(C))) ∨ (σ(θ(B)))
T
then we have (IH) clause isormorphisms ω ′ : AT ∨ C→(θ(A)) ∨ (θ(C)) and
T
ω ′ : BT ∨ D→(θ(B)) , θ(D)
AT ∨ C BF ∨ D
2. thus Res where ρ = mgu(A, B)(exists, as σ ◦ θ unifier)
(ρ(C)) ∨ (ρ(B))