0% found this document useful (0 votes)
38 views289 pages

Algorithms For Validation

The document is a book titled 'Algorithms for Validation' by Mykel J. Kochenderfer and colleagues, focusing on algorithms for validating safety-critical systems. It covers a range of topics including validation definitions, system modeling, property specifications, and various algorithms for falsification and reachability. The book is intended for advanced students and professionals in fields like mathematics, computer science, and engineering, and emphasizes the importance of validation in the development of decision-making systems.

Uploaded by

khanhng1982
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views289 pages

Algorithms For Validation

The document is a book titled 'Algorithms for Validation' by Mykel J. Kochenderfer and colleagues, focusing on algorithms for validating safety-critical systems. It covers a range of topics including validation definitions, system modeling, property specifications, and various algorithms for falsification and reachability. The book is intended for advanced students and professionals in fields like mathematics, computer science, and engineering, and emphasizes the importance of validation in the development of decision-making systems.

Uploaded by

khanhng1982
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 289

Algorithms for Validation

Algorithms for Validation

Mykel J. Kochenderfer
Sydney M. Katz
Anthony L. Corso
Robert J. Moss

Stanford, California
© 2024 Kochenderfer, Katz, Corso, and Moss

All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means
(including photocopying, recording or information storage and retrieval) without permission in writing from
the publisher.

This book was set in TEX Gyre Pagella by the authors in LATEX.
Printed and bound in the United States of America.

Library of Congress Cataloging-in-Publication Data is available.

ISBN:

10 9 8 7 6 5 4 3 2 1
To our families.
Contents

Preface xi
1 Introduction 1
1.1 Validation 1
1.2 History 3
1.3 Societal Consequences 6
1.4 Validation Algorithms 8
1.5 Challenges 14
1.6 Overview 16
2 System Modeling 19
2.1 Coming Soon 19
3 Property Specification 21
3.1 Properties of Systems 21
3.2 Metrics for Stochastic Systems 22
3.3 Composite Metrics 24
3.4 Logical Specifications 30
3.5 Temporal Logic 33
3.6 Reachability Specifications 41
3.7 Summary 45
4 Falsification through Optimization 49
4.1 Direct Sampling 49
4.2 Disturbances 50
4.3 Fuzzing 53
viii co ntents

4.4 Falsification through Optimization 56


4.5 Objective Functions 57
4.6 Optimization Algorithms 60
4.7 Summary 63
5 Falsification through Planning 65
5.1 Shooting Methods 65
5.2 Tree Search 67
5.3 Heuristic Search 68
5.4 Monte Carlo Tree Search 80
5.5 Reinforcement Learning 84
5.6 Simulator Requirements 85
5.7 Summary 88
6 Failure Distribution 89
6.1 Distribution over Failures 89
6.2 Rejection Sampling 90
6.3 Markov Chain Monte Carlo 94
6.4 Probabilistic Programming 102
6.5 Summary 104
7 Failure Probability Estimation 107
7.1 Direct Estimation 107
7.2 Importance Sampling 113
7.3 Adaptive Importance Sampling 119
7.4 Sequential Monte Carlo 125
7.5 Ratio of Normalizing Constants 129
7.6 Multilevel Splitting 138
7.7 Summary 141
7.8 Exercises 142
8 Reachability for Linear Systems 145
8.1 Forward Reachability 145
8.2 Set Propagation Techniques 147
8.3 Set Representations 154
8.4 Reducing Computational Cost 158
8.5 Linear Programming 163
8.6 Summary 167
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
contents ix

9 Reachability for Nonlinear Systems 171


9.1 Interval Arithmetic 171
9.2 Inclusion Functions 173
9.3 Taylor Models 180
9.4 Concrete Reachability 182
9.5 Optimization-Based Nonlinear Reachability 187
9.6 Partitioning 190
9.7 Neural Networks 194
9.8 Summary 197
10 Reachability for Discrete Systems 199
10.1 Graph Formulation 199
10.2 Reachable Sets 201
10.3 Satisfiability 203
10.4 Probabilistic Reachability 210
10.5 Discrete State Abstractions 216
10.6 Summary 219
11 Runtime Monitoring 223
11.1 Coming Soon 223
12 Explainability 225
12.1 Explanations 225
12.2 Policy Visualization 226
12.3 Feature Importance 227
12.4 Policy Explanation through Surrogate Models 237
12.5 Counterfactual Explanations 243
12.6 Failure Mode Characterization 249
12.7 Summary 253
A Problems 255
A.1 Coming Soon 255
B Mathematical Concepts 257
B.1 Coming Soon 257
C Optimization 259
C.1 Coming Soon 259

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
x c ontents

D Neural Networks 261


D.1 Coming Soon 261
E Julia 263
E.1 Coming Soon 263
References 265

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
Preface

This book provides a broad introduction to algorithms for validating safety-critical


systems. We cover a wide variety of topics related to validation, introducing the
underlying mathematical problem formulations and the algorithms for solving
them. Figures, examples, and exercises are provided to convey the intuition behind
the various approaches.
This book is intended for advanced undergraduates and graduate students, as
well as professionals. It requires some mathematical maturity and assumes prior
exposure to multivariable calculus, linear algebra, and probability concepts. Some
review material is provided in the appendices. Disciplines where the book would
be especially useful include mathematics, statistics, computer science, aerospace,
electrical engineering, and operations research.
Fundamental to this textbook are the algorithms, which are all implemented
in the Julia programming language. We have found this language to be ideal for
specifying algorithms in human-readable form. The priority in the design of the
algorithmic implementations was interpretability rather than efficiency. Indus-
trial applications, for example, may benefit from alternative implementations.
Permission is granted, free of charge, to use the code snippets associated with
this book, subject to the condition that the source of the code is acknowledged.

M ykel J. Ko chenderfer
Sydney M. Katz
Antho ny L . Corso
Ro bert J. Mo ss
Stanford, California
August 28, 2024
1 Introduction

Before deploying decision-making systems in high-stakes settings, it is important


to ensure that they will operate as intended. We refer to the process of analyzing
the behavior of these systems as validation. Validation is a critical component of
the development process for decision-making systems in a variety of domains
including autonomous vehicles, robotics, and healthcare. As these systems and
their operating environments increase in complexity, understanding the full
spectrum of possible behaviors becomes more challenging and requires a rigorous
validation process. This book discusses these challenges and presents a variety of
computational methods for validating autonomous systems. This chapter begins
with a broad overview of validation. We motivate the need for validation from a
historical perspective and outline the societal consequences of validation failures.
We then introduce the validation framework that we will use throughout the
book. We discuss the challenges associated with validation and conclude with an
overview of the remaining chapters in the book.

1.1 Validation

The concept of validation is defined differently by different communities, and the


word itself is often used in conjunction with other terms such as verification and
testing.1 In this book, we define validation as the broad process of establishing 1
For a discussion on these defini-
confidence that a system will behave as desired when deployed in the real world. tions, see section 1.2.3 of A. Engel,
Verification, Validation, and Testing
We define verification as a special type of validation that provides guarantees of Engineered Systems. John Wiley &
about the correctness of a system with respect to a specification. We define testing Sons, 2010, vol. 73.
as a technique used for validation that involves evaluating the system on a discrete
set of test cases.
2 c hap ter 1 . i ntroduction

From a systems engineering perspective, validation is viewed as a phase of the


development cycle for autonomous systems (figure 1.1). A typical development
cycle begins by defining a set of operational requirements for the system. For Define
example, the developers of an aircraft collision avoidance system may identify Requirements
a requirement on the probability of collision when deployed in the airspace.
Designers then use these requirements to produce an initial version of the system. Design
In the aircraft collision avoidance example, the system may consist of a decision-
making agent that selects actions to avoid collisions based on sensor information.
A common technique for designing a system to match a set of desired require- Validate
ments is to optimize the system with respect to an objective function or reward
model that captures the requirements. However, the models used to perform Figure 1.1. A typical development
the optimization may be imperfect, the optimization objective may not perfectly cycle for an autonomous system.
capture the requirements, and the optimization process itself is often approximate. This book focuses on the validation
phase of development.
This misalignment can result in a mismatch between the desired behavior of the
system and its actual behavior when deployed in the real world. We refer to this
phenomenon as the alignment problem.2 2
A detailed discussion of the align-
The alignment problem motivates the need for the validation phase of the ment problem is provided in B.
Christian, The Alignment Problem:
development cycle. Given the requirements and design, validation algorithms Machine Learning and Human Values.
analyze whether the system will behave as intended when deployed in its operat- W. W. Norton & Company, 2020.
ing environment. Based on the results of the validation process, developers may
need to revise the design or requirements. This process is often repeated multiple
times before the system is ready for deployment. It is important to perform vali-
dation early in the development cycle to detect bugs and misalignments before
they become more costly to fix. For example, repairing a software bug during
maintenance is often orders of magnitude more expensive than fixing the bug 3
C. Baier and J.-P. Katoen, “Princi-
early in the development cycle.3 ples of Model Checking,” in MIT
Press, 2008, ch. 1.
This book focuses entirely on the validation phase of the development cycle.
We assume that we are given a system that has been designed to meet a set 4
More information about the
of established requirements, and we discuss methods to translate the system systems engineering process can
be found in A. Kossiakoff, S. M.
and its requirements to computational models and formal specifications that Biemer, S. J. Seymour, and D. A.
allow us to apply a variety of validation algorithms. In other words, this book Flanigan, Systems Engineering
Principles and Practice. John Wiley &
is not about systems engineering or the development of systems.4 Instead, we Sons, 2020. A variety of algorithms
focus on algorithms that validate the behavior of these systems in their operating for designing decision-making
environments. systems are provided in M. J.
Kochenderfer, T. A. Wheeler, and
Validation techniques have been developed for a wide variety of systems K. H. Wray, Algorithms for Decision
ranging from aircraft parts to medical devices to customer service chatbots. For Making. MIT Press, 2022.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
1.2. history 3

example, aircraft designers validate the structural integrity of the wings through
extensive stress testing, and medical device manufacturers validate the safety
of their devices through clinical trials. In this book, we present an algorithmic
perspective on validation and focus specifically on the validation of decision-
making agents.
Decision-making agents interact with the environment and make decisions
based on the information they receive. These agents range from fully automated
systems that operate independently within their environment to decision-support
systems that inform human decision-makers.5 Examples include aircraft collision 5
Autonomy and automation have
different definitions in different
avoidance systems, adaptive cruise control systems, hiring assistants, disaster
communities. Autonomy is often
response systems, and other cyberphysical systems.6 While the algorithms pre- defined as the automation of high-
sented in this book can be applied to many different types of decision-making level tasks such as driving. The al-
gorithms in this book can be ap-
agents, we place a particular emphasis on sequential decision-making agents, plied to decision-making systems
which make a series of decisions over time. For example, an autonomous vehicle with any level of automation or au-
must make a sequence of decisions to navigate from one location to another. tonomy.
6
Cyberphysical systems are com-
putational systems that interact
1.2 History with the physical world.

The history of validation is deeply intertwined with the evolution of complex


systems across many domains. Early forms of validation can be traced back to the
ideas of ancient Greek philosophers such as Aristotle (384–322 BC).7 Aristotle 7
W. M. Dickie, “A Comparison of
advocated for a continuous cycle of observation and experimentation to validate the Scientific Method and Achieve-
ment of Aristotle and Bacon,” The
hypotheses. The scientific method introduced during the scientific revolution of Philosophical Review, vol. 31, no. 5,
the 16th and 17th centuries formalized this notion. During this time, Francis Bacon pp. 471–494, 1922.
(1561–1626) proposed a method for validating scientific hypotheses through
empirical observation and experimentation.
The technological changes brought on by the industrial revolutions accelerated
progress in validation. During the First Industrial Revolution of the late 18th and
early 19th centuries, the complexity of systems increased dramatically, and the
field of validation shifted from validating ideas and hypotheses to validating
machines and production processes. The increase of mass production in factories
during the Second Industrial Revolution (1870–1914) further motivated the need
for validation. Supervisors began to perform quality control checks on products to
ensure that they met the desired specifications.8 As production volume increased 8
K. Ishikawa and J. H. Loftus, “In-
in the following years, supervisors could no longer inspect every product, and troduction to Quality Control,” in
Springer, 1990, vol. 98, ch. 1.
factories began to hire designated inspectors for quality control.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
4 c h ap ter 1 . i ntroduction

Waterfall Model V Model Figure 1.2. Comparison of the wa-


terfall and V models of the soft-
ware development lifecycle.
Concept of Operation &
Requirements
Operations Maintenance

Design
System System
Requirements Validation
Development

Detailed Unit
Testing
Design Tests

Deployment
Implementation
Maintenance

During World War II, production volume increased to the point where it was no 9
W. M. Tsutsui, “W. Edwards Dem-
longer possible to inspect every product. This increase in production output led ing and the Origins of Quality Con-
trol in Japan,” Journal of Japanese
to the adoption of statistical quality control methods, which relied on sampling Studies, vol. 22, no. 2, pp. 295–325,
to speed up inspection. These ideas were developed by W. Edwards Deming9 1996.
(1900–1993) and Joseph M. Juran10 (1904–2008) and marked the beginning of 10
D. Phillips-Donaldson, “100
the field of statistical process control. Deming and Juran introduced these ideas Years of Juran,” Quality Progress,
vol. 37, no. 5, pp. 25–31, 2004.
to Japanese manufacturers after World War II, which played a key role in the
post-war economic recovery of Japan.
The advancements in computing technology in the latter half of the 20th cen-
tury increased our ability to use statistical methods to validate complex systems.
In the late 1940s, scientists at Los Alamos National Laboratory developed the
Monte Carlo method, which uses random sampling to solve complex mathemati-
cal problems.11 These methods were later used to validate complex systems in A. F. Bielajew, “History of Monte
11

a variety of domains such as aviation and finance. Progress in computing tech- Carlo,” in Monte Carlo Techniques in
Radiation Therapy, CRC Press, 2021,
nology also led to new challenges in validation. The development of software pp. 3–15.
systems required new validation techniques and best practices to ensure that the
software operated correctly.
In the 1970s, software engineers began formalizing the software development
life cycle into phases that supported rigorous testing and validation. The water-
fall model of software development, introduced in 1970, divided the software
development process into distinct phases including requirements, design, im-

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
1.2. history 5

plementation, testing, and maintenance.12 In the 1990s, the waterfall model was 12
W. W. Royce, “Managing the De-
refined into the V model, which emphasizes the importance of testing and valida- velopment of Large Software Sys-
tems: Concepts and Techniques,”
tion throughout the software development process.13 The V model aligns testing IEEE WESCON, 1970.
and validation activities with the corresponding development activities, ensuring 13
K. Forsberg and H. Mooz, “The
Relationship of System Engineer-
that the system is validated at each stage of development. Figure 1.2 compares
ing to the Project Cycle,” Center
the waterfall and V models of the software development life cycle. for Systems Management, vol. 5333,
The 20th century also saw the emergence of regulatory bodies to guide the safe 1991.

development of new technologies. The Food and Drug Administration (FDA)


was established in the United States in 1906 after a series of food and drug safety
incidents.14 In 1947, the International Organization for Standardization (ISO) 14
A. T. Borchers, F. Hagie, C. L.
was founded to develop international standards for products and services.15 After Keen, and M. E. Gershwin, “The
History and Contemporary Chal-
a series of midair collisions between aircraft, the Federal Aviation Administration lenges of the US Food and Drug
(FAA) was formed in 1958 to regulate civil aviation in the United States.16 Administration,” Clinical Therapeu-
tics, vol. 29, no. 1, pp. 1–16, 2007.
As technology matured in the late 20th and early 21st centuries, these reg- 15
C. N. Murphy and J. Yates, The In-
ulatory bodies introduced new standards and requirements. For example, the ternational Organization for Standard-
Radio Technical Commission for Aeronautics (RTCA) introduced the DO-178 ization (ISO): Global Governance
Through Voluntary Consensus. Rout-
standard in 1982 to provide guidelines for the development of safety-critical ledge, 2009.
software in aviation. DO-178 has been updated multiple times in the following 16
J. W. Gelder, “Air Law: The Fed-
years to account for new technological advancements in the field and has been eral Aviation Act of 1958,” Michigan
Law Review, vol. 57, no. 8, pp. 1214–
used frequently by the FAA to certify the safety of aircraft software.17 In 2011, 1227, 1959.
ISO 26262 was introduced as an international standard relating to the functional 17
More information on the history
safety of automotive systems. While ISO 26262 was developed specifically for of software standards in aviation
can be found in L. Rierson, Devel-
electronic/electric systems in road vehicles, many researchers have used it as a oping Safety-Critical Software: a Prac-
guideline for the development of both hardware and software for autonomous tical Guide for Aviation Software and
vehicles.18 DO-178C Compliance. CRC Press,
2017.
Starting in the 2010s, artificial intelligence (AI) and machine learning sys- 18
M. A. Gosavi, B. B. Rhoades, and
tems became increasingly prevalent in a variety of applications. For example, AI J. M. Conrad, “Application of Func-
systems were introduced into autonomous vehicles, aircraft, medical diagnosis, tional Safety in Autonomous Vehi-
cles Using ISO 26262 Standard: A
and financial trading. The increased capabilities and applications of AI led to Survey,” in SoutheastCon, 2018.
new validation challenges and techniques. Not only are the systems themselves
complex, but they also operate in complex environments, making validation of
these systems particularly challenging. In 2020, the European Union Aviation
Safety Agency (EASA) published initial guidelines related to the design assur-
ance of neural networks, which are a key component of many machine learning
systems.19 In that document, they outline a modification of the traditional V 19
EASA AI Task Force, “Concepts
of Design Assurance for Neural
Networks,” EASA, 2020.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
6 c h ap ter 1. i ntroduction

model to account for validation of the learning process. In general, the validation
of AI systems is still an active area of research.

1.3 Societal Consequences

The validation of decision-making agents is critical in ensuring that these systems


are properly integrated into society. Failures in validation can have severe societal
consequences. This section discusses the impacts of validation on various aspects
of society.

1.3.1 Safety
Validation is necessary for ensuring the safety of systems that interact with the
physical world. Failures of safety-critical systems can result in catastrophic acci- 20
N. G. Leveson and C. S. Turner,
dents that cause injury or loss of life. For example, unintended behavior of the “An Investigation of the Therac-25
Accidents,” Computer, vol. 26, no. 7,
safety-critical software used by the Therac-25 radiation therapy machine caused pp. 18–41, 1993.
radiation overdoses that resulted in death or serious injury to six patients.20 Safety
is also important for transportation systems such as aircraft and cars. In 2002, 21
J. Kuchar and A. C. Drumm, “The
a mid-air collision over Überlingen, Germany resulted in 71 fatalities when the Traffic Alert and Collision Avoid-
ance System,” Lincoln Laboratory
traffic alert and collision avoidance system (TCAS) and air traffic control (ATC) Journal, vol. 16, no. 2, p. 277, 2007.
systems issued conflicting instructions to the pilots.21 Furthermore, it is important 22
R. L. McCarthy, “Autonomous
Vehicle Accident Data Analy-
to ensure that autonomous vehicles make safe decisions in a wide range of scenar-
sis: California OL 316 Reports:
ios to prevent potential accidents. Since their introduction, autonomous vehicles 2015–2020,” ASCE-ASME Journal
have been involved in accidents that have resulted in injuries or fatalities.22 of Risk and Uncertainty in Engi-
neering Systems, Part B: Mechanical
Engineering, vol. 8, no. 3, p. 034 502,
1.3.2 Fairness 2022.

When agents make decisions that affect the lives of large groups of people, we must
ensure that their decisions are fair and unbiased. Validation helps researchers
and organizations identify and correct biases in decision-making systems before
deployment. If these biases are not addressed, they can have serious consequences
for individuals and society as a whole. For example, an automated hiring system 23
A. L. Hunkenschroer and
developed by Amazon was ultimately discontinued after it was found to be A. Kriebitz, “Is AI Recruiting
(Un)ethical? A Human Rights
biased against women due to biases in the historical data it was trained on.23 In Perspective on the Use of AI for
another case, a software system designed to predict recidivism rates in criminal Hiring,” AI and Ethics, vol. 3, no. 1,
pp. 199–213, 2023.
defendants called COMPAS was found to be biased toward certain demographics

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
1.3. societal consequences 7

based on empirical data.24 Using the outputs of these systems to make decisions 24
Other research has argued that
can result in the unfair treatment of individuals. Validating these systems before the system is fair under a differ-
ent definition of fairness. A de-
deployment can help prevent this type of failure. tailed discussion is provided in J.
Kleinberg, S. Mullainathan, and
M. Raghavan, “Inherent Trade-Offs
1.3.3 Public Trust in the Fair Determination of Risk
Scores,” in Innovations in Theoretical
Public trust in autonomous systems is critical for their widespread adoption, and Computer Science (ITCS) Conference,
validation plays a key role in developing this trust. For example, trust has been 2017.

identified as a key factor in the eventual adoption of autonomous vehicles into


society.25 For this reason, autonomous vehicle designers and manufacturers have 25
J. K. Choi and Y. G. Ji, “Inves-
invested heavily in validation to ensure that their vehicles are safe and reliable. tigating the Importance of Trust
on Adopting an Autonomous Vehi-
The aviation industry is another example of an industry that relies on public cle,” International Journal of Human-
trust. The industry has maintained public trust by upholding a rigorous safety Computer Interaction, vol. 31, no. 10,
pp. 692–702, 2015.
process that has resulted in a strong safety record. However, failures in validation
can erode public trust. For instance, when the Boeing 737 MAX 8 aircraft was
grounded worldwide after two fatal crashes, public trust in the aviation industry
was significantly impacted.26 Validation also allows us to anticipate possible 26
J. Herkert, J. Borenstein, and
ethical dilemmas before deployment.27 Addressing these dilemmas is crucial to K. Miller, “The Boeing 737 MAX:
Lessons for Engineering Ethics,”
maintaining trust. Science and Engineering Ethics,
vol. 26, pp. 2957–2974, 2020.
27
An example of an ethical analy-
1.3.4 Economics sis for autonomous vehicles can be
found in J. Siegel and G. Pappas,
Systems that operate expensive equipment or control finances require validation “Morals, Ethics, and the Technol-
to decrease the risk of significant economic loss. In 1996, the maiden voyage of the ogy Capabilities and Limitations
of Automated and Self-Driving Ve-
Ariane 5 rocket ended in an explosion that could ultimately be traced back to a hicles,” AI & Society, vol. 38, no. 1,
software bug caused by overflow when converting from a 64-bit to 16-bit value.28 pp. 213–226, 2023.
The failure resulted in loss of the rocket and the research satellites it was carrying 28
M. Dowson, “The Ariane 5 Soft-
ware Failure,” Software Engineering
to space for a total of $370 million in damages. Furthermore, failures in financial
Notes, vol. 22, no. 2, p. 84, 1997.
decision-making systems can affect entire economic systems. For example, the
failure of the Long-Term Capital Management (LTCM) hedge fund in 1998 nearly
caused a global financial crisis and required a $3.6 billion bailout. The fund used
a trading strategy that failed to account for extreme events.29 When these events 29
P. Jorion, “Risk Management
occurred, LTCM suffered massive losses. Lessons from Long-Term Capital
Management,” European Financial
Management, vol. 6, no. 3, pp. 277–
300, 2000.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
8 c hap ter 1 . i ntroduction

Figure 1.3. Validation algorithms


System
check whether a given system sat-
Agent Environment isfies a specification. The system
consists of an agent operating in
an environment, which it perceives
Sensor using a sensor or set of sensors.

Validation
Algorithm

Specification

1.4 Validation Algorithms

Validation algorithms require two inputs, as shown in figure 1.3. The first input is
the system under test, which we will refer to as the system. The system represents
a decision-making agent operating in an environment. The agent makes decisions
based on information from the environment that it receives from sensors.30 The 30
Up to this point, we have infor-
second input is a specification, which expresses an operating requirement for the mally used the term system to refer
to only the agent and its sensors.
system. Specifications often pertain to safety, but they may also address other key For the remainder of the book, we
design objectives. Given these inputs, validation algorithms output metrics to will also include the operating en-
vironment as part of the system.
help us understand the scenarios in which the system does or does not satisfy
the specification. The rest of this section provides a high-level overview of these
inputs and outputs.

1.4.1 System
A system (algorithm 1.1) consists of three main components: an environment,
an agent, and a sensor. The environment represents the world in which the agent
operates. We refer to an agent’s configuration within its environment as its state s.
The state space S represents the set of all possible states. An environment consists
of an initial state distribution and a transition model. When the agent takes an
action, the state evolves probabilistically according to the transition model. The
transition model T (s0 | s, a) denotes the probability of transitioning to state s0
from state s when the agent takes action a.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
1.4. validation algorith ms 9

abstract type Agent end Algorithm 1.1. A system consists


abstract type Environment end of an agent, its operating environ-
abstract type Sensor end ment, and the sensor or set of sen-
sors that it uses to perceive its en-
struct System vironment.
agent::Agent
env::Environment
sensor::Sensor
end

For physical systems, the state often represents an agent’s position and velocity
in the environment, and the transition model is typically governed by the agent’s
equations of motion. Figure 1.4 shows an example of a state for an inverted
pendulum system. The state and transition model may also contain information
about other agents in the environment. For example, the environment for an
aircraft collision avoidance system contains the other aircraft in the airspace that w
the agent must avoid. The other agents may also be human agents such as other q
drivers or pedestrians in the environment of an autonomous vehicle. The presence
of other agents in the environment often increases our uncertainty in the outcome s = [q, w ]
of a particular action.
In many real-world systems, agents do not have access to their true state within Figure 1.4. The state s of an in-
the environment and instead rely on observations from sensors. We define the verted pendulum system can be
sensor component of a system as a mechanism for sensing information about the compactly represented as its cur-
rent angle from the vertical q and
environment. Many real-world systems rely on multiple sensors, so the sensor its angular velocity w.
component may contain multiple sensing modalities. For example, an autonomous
vehicle senses its position in the world using a combination of sensors such as
global positioning systems (GPS), cameras, and LiDAR. We model the sensor
component using an observation model O(o | s), which represents the probability
of producing observation o in state s. Observations come in multiple forms based
on the sensing modality. For example, GPS sensors output coordinates, while
camera sensors output image data. We call the set of all possible observations for
a system its observation space O .
An agent uses observations to select actions from a set of possible actions
known as the action space A. Agents may use a number of decision-making
algorithms or frameworks to select actions. While some agents select actions
based entirely on the observation, other agents use the observation to first estimate
the state and then select an action based on this estimate. Furthermore, some

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
10 c h ap ter 1 . i ntroduction

Figure 1.5. A system consists of an


Agent Environment agent with policy p, an environ-
ment governed by transition model
Policy action a Transition Model T, and a sensor with observation
p T model O.

observation o state s
Sensor

Observation Model
O

agents may keep track of previous actions and observations internally to improve
their state estimate. For example, an aircraft that only observes its altitude may
keep track of previous altitude measurements to estimate its climb or descent
rate. We abstract these behaviors of the agent using the notion of a policy p,
which is responsible for selecting an action given the current observation and
information the agent has stored previously. An agent’s policy can be stochastic
or deterministic. A stochastic policy samples actions according to a probability
distribution, while a deterministic policy will always produce the same action
given the same information.
The transition model T (s0 | s, a) satisfies the Markov assumption, which requires
that the next state depend only on the current state and action. The state space,
action space, observation space, observation model, and transition model are all el-
ements of a sequential decision-making framework known as a partially observable
Markov decision process (POMDP).31 Figure 1.5 demonstrates how these elements 31
M. J. Kochenderfer, T. A. Wheeler,
fit into the components of a system. Appendix A provides implementations of and K. H. Wray, Algorithms for De-
cision Making. MIT Press, 2022.
these components for the example systems discussed in this book.
We analyze the behavior of a system over time by considering the sequence
of states, observations, and actions that the agent experiences. This sequence
is known as a trajectory. We generate trajectories by performing a rollout of the
system (algorithm 1.2). A rollout begins by sampling an initial state from the
initial state distribution associated with the environment. At each time step, the
sensor produces an observation based on the current state, the agent selects an

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
1.4. validation algorithms 11

function step(sys::System, s) Algorithm 1.2. A function that per-


o = sys.sensor(s) forms a rollout of a system sys to
a = sys.agent(o) a depth d and returns the result-
s′ = sys.env(s, a) ing trajectory τ. It samples an ini-
return (; o, a, s′) tial state from the initial state dis-
end tribution associated with the envi-
ronment. It then repeatedly calls
function rollout(sys::System; d) the step function, which steps the
s = rand(Ps(sys.env)) system forward in time. The step
τ = [] function takes in the current state
for t in 1:d s, produces an observation o from
o, a, s′ = step(sys, s) the sensor, gets the action a from
push!(τ, (; s, o, a)) the agent based on this observa-
s = s′ tion, and determines the next state
end s′ from the environment.
return τ
end

Step 1 Step 2 Step 3 Step 4 Figure 1.6. Example trajectory of


depth d = 4 for the inverted pendu-
s = [0.2, 0.0] s = [0.2, 0.2] s = [0.2, 0.3] s = [0.2, 0.0] lum system. At each time step, the
sensor produces a noisy observa-
o = [0.3, 0.0] o = [0.3, 0.1] o = [0.1, 0.3] o = [0.2, 0.1]
tion of the true state, and the agent
a= 4.5 a= 4.1 a = 1.4 a= 1.7 tries to keep the pendulum upright
by selecting a torque to apply at the
base of the pendulum.

action based on the observation, and the environment transitions to a new state
based on the action. We repeat this process to a desired depth d to generate a
trajectory t = (s1 , o1 , a1 , . . . , sd , od , ad ) where si+1 ⇠ T (· | si , ai ), oi ⇠ O(· | si ),
and ai ⇠ p (· | oi ). Figure 1.6 shows an example trajectory for the inverted
pendulum system.

1.4.2 Specification
A specification y is a formal expression of a requirement that the system must
satisfy when deployed in the real world. These requirements may be derived from
domain knowledge or other systems engineering principles. Some industries

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12 c h ap ter 1 . i ntroduction

have regulatory agencies that govern requirements. These agencies are especially
common in safety-critical industries. For example, the FAA and the FDA in the
United States provide regulations and requirements for aircraft and healthcare
systems, respectively.
We express specifications by translating operating requirements to logical
formulas that can be evaluated on trajectories.32 For example, the specification 32
Chapter 3 discusses this process
for an aircraft collision avoidance system is that the agent should not collide with in detail.

other aircraft in the airspace. Given a trajectory, we want to check whether any of
the states in the trajectory represent a collision.
Algorithm 1.3 defines a general framework for specifications that we will use
throughout this book. Evaluating a specification on a trajectory results in a Boolean
value that indicates whether the specification is satisfied. We consider a trajectory
to be a failure if the specification is not satisfied. Example 1.1 demonstrates this
idea on a simple grid world system. We can also derive higher-level metrics from
specifications such as the probability of failure or the expected cost of failure.

abstract type Specification end Algorithm 1.3. Definition of a speci-


function evaluate(ψ::Specification, τ) end fication. We evaluate specifications
isfailure(ψ::Specification, τ) = !evaluate(ψ, τ) on trajectories. We consider a tra-
jectory to be a failure if the specifi-
cation is not satisfied.

In the grid world example shown on the right, the agent’s goal is to navigate Example 1.1. Example trajectories
evaluated against a specification
to the green goal state while avoiding the red obstacle state. Therefore, given for the grid world system.
a trajectory, the specification y will be satisfied if the trajectory contains the
goal state and does not contain the obstacle state. The green trajectory in the
figure satisfies the specification, while the red trajectory represents a failure.
Chapter 3 will discuss how to express this specification as a logical formula.

1.4.3 Algorithm Outputs


Validation algorithms provide a variety of outputs that help us understand the
behavior of a system. These outputs can be used to make decisions about the
system’s design, requirements, and deployment. Different validation algorithms
are designed to output different metrics. The algorithms presented in this book
support the following categories of analysis:

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
1.4. validation algorithms 13

Falsification Failure Distribution Failure Probability Figure 1.7. Failure analysis outputs
for a simple system where failures
occur to the left of the dashed red
line with likelihood represented by
the height of the black curve. The
plot on the left shows a set of fail-
ure samples that could be identi-
fied through falsification. The plot
in the middle highlights the shape
of the failure distribution, and the
shaded region in the plot on the
right corresponds to the probabil-
ity of failure.

• Failure analysis: Common types of failure analysis include falsification, failure


distribution estimation, and failure probability estimation (figure 1.7). Falsifi-
cation involves searching for possible scenarios that result in a failure. Some
falsification algorithms also use a probabilistic model of the system to search
for the most likely failure scenario. Other algorithms use this model to draw
samples from the full distribution over failures or to estimate the probability
of failure. We can use the results of failure analysis to inform future design
decisions. Depending on the type and severity of the failure modes, system
designers may enhance the system’s sensors, change the agent’s policy, revise
the system’s requirements, adapt the training of human operators, or bring
in other mitigations. Designers may also simply recognize the failure modes 33
These requirements are based on
as limitations and move on or use them as grounds to abandon the project the type and severity of the failure.
More information can be found in
altogether. Furthermore, an estimate of the probability of failure can be used to T. L. Arel, Safety Management Sys-
make decisions about the system’s deployment. For example, the FAA places tem Manual, Air Traffic Organiza-
tion, Federal Aviation Administra-
requirements on the probability of failure for aircraft systems before they can tion, 2022.
be deployed in the airspace.33

• Formal guarantees: Some algorithms output formal guarantees, or proofs, that


a system satisfies a specification. One common type of formal guarantee is a
reachability guarantee, in which we determine the set of states that a system
could reach over time. The result can be used to prove that a system will never
enter a dangerous state. For example, we could prove that an aircraft collision
avoidance system will never reach a collision state. Formal guarantees are
always based on a set of assumptions such as the set of possible initial states.
If the assumptions are violated, the guarantees may no longer hold.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
14 c hap ter 1. i ntroduction

• Explanations: The ability to explain the behavior of a system helps us build


confidence that it is operating as intended. Explanations can take many forms.
We may want to explain why an agent made a decision at a particular instance
in time or identify the root cause of a failure trajectory we found through falsi-
fication. We can use explanations during design to debug the system, identify
potential failure modes, and suggest possible improvements. Explanations can
also be used to build trust with stakeholders and regulatory bodies.

• Runtime assurances: The validation metrics we compute before deploying a


system are typically based on a set of assumptions about its operating environ-
ment. If the operating environment changes during deployment, these metrics
may no longer be valid. Runtime monitoring algorithms check whether these
assumptions are being violated during operation and provide assurances that
the system is operating as intended. We can use runtime monitoring to detect
when the system deviates from its intended behavior and provide alerts to
operators.

In most real-world settings, we cannot guarantee that a system will behave


as intended using a single validation algorithm or metric. Instead, we use a
combination of these techniques to build a safety case. This idea is inspired by the
Swiss cheese model of accident causation (figure 1.8).34 This model views validation 34
J. Reason, “Human Error: Mod-
algorithms as slices of Swiss cheese35 with holes, or limitations, that may cause els and Management,” British Med-
ical Journal, vol. 320, no. 7237,
us to miss potential failure modes. If we stack enough slices of Swiss cheese pp. 768–770, 2000.
together, the holes in one slice will be covered by the cheese in another slice. By 35
Swiss cheese is a type of cheese
using a combination of validation algorithms, we increase our chances of catching that is known for having holes in
its slices.
potential failure modes before they could occur during operation.

1.5 Challenges

Validating that a decision-making agent will behave as intended when deployed in


the real world is a challenging problem. Several factors contribute to this difficulty:

• Complexity of the agent: It can be difficult to predict how a decision-making


agent will behave in all possible scenarios. For example, the autonomy stack of
a self-driving car contains multiple components that interact with one another
in complex ways. This complexity makes it challenging to understand how the
system will react to different inputs such as sensor data, maps, and traffic laws.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
1.5. challenges 15

Failure Formal Runtime Figure 1.8. The swiss cheese model


Explanations
Analysis Guarantees Assurances for safety validation. Each layer
represents a different validation al-
gorithm. The holes in each layer
represent the limitations of the val-
idation algorithm. By stacking the
layers together, we prevent poten-
tial failures from getting through
to deployment.

Furthermore, it is especially difficult to predict the behavior of decision-making


agents that use machine learning models such as neural networks. These
models are often difficult to interpret and can exhibit unexpected behaviors.

• Complexity of the environment: As the capabilities of autonomous agents increase,


they are deployed in increasingly complex environments. For example, self-
driving cars must navigate through environments with pedestrians, traffic
signs, construction, and other vehicles. To validate these agents, we must be
able to properly model this complexity. Another challenge arises when agents
use complex sensors to perceive their environment. For example, for systems
that use camera sensors, we need to understand the set of images the camera
could produce from the environment.

• Cost and safety: Testing systems in the real world is expensive and can lead
to safety issues. For example, testing an aircraft collision avoidance system
involves operating aircraft in close proximity with one another for long periods
of time. For this reason, we often rely on simulation to test systems before
deploying them in the real world. We must be careful to ensure that the simu-
lated system accurately models the real-world system. However, capturing the
full complexity of the real world in simulation can result in simulators that are
computationally expensive to run.

• Edge cases: Systems designed for safety-critical applications tend to behave


safely in the vast majority of scenarios. However, rare edge cases can lead to

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
16 c h ap ter 1. i ntroduction

catastrophic failures. Because these edge cases occur infrequently, they are
often difficult to identify.

1.6 Overview

This section outlines the remaining chapters of the book, which can be organized
into several categories:

• Problem formulation: Chapters 2 and 3 discuss techniques to formulate validation


problems. Specifically, chapter 2 relates to the system, which is the first input
to validation algorithms. We discuss how to build computational models of
each system component using data and domain knowledge. The accuracy of
the validation process depends on the accuracy of these models. Therefore, we
also discuss techniques to validate the accuracy of these models. Chapter 3
addresses the specification, which is the second input to validation algorithms.
In this chapter, we discuss techniques to translate operating requirements for
systems to formal specifications on their behavior.

• Sampling-based methods: Chapters 4 to 7 discuss methods that use trajectory


samples from a system to analyze its behavior. Since it is often impossible to
sample all possible behaviors of a system, these techniques typically focus on
failure analysis rather than formal guarantees. Chapters 4 and 5 discuss effi-
cient techniques to search for possible failures of a system using optimization
and planning algorithms respectively. Chapter 6 outlines a set of techniques to
draw samples from the full distribution over failures for a system, and chap-
ter 7 discusses efficient techniques to estimate the probability of failure from
samples.

• Formal methods: Chapters 8 to 10 discuss formal methods that provide guaran-


tees on the behavior of a system. These methods can be used to systematically
search for failures of a system or to prove the absence of failures if there are
none. Chapter 8 discusses reachability techniques that compute the set of states
that a system could reach over time. We can use the results of this analysis
to determine whether to system reaches any states that violate the specifica-
tion. Chapter 9 extends these techniques to systems with nonlinear models. In
chapter 10, we work perform reachability analysis on discrete systems.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
1.6. overview 17

• Runtime monitoring and explainability: Chapters 11 and 12 discuss techniques to


monitor a system’s behavior and explain its decisions. In chapter 11, we discuss
a form of online validation called runtime monitoring, which checks whether a
system is operating as intended during deployment. Finally, chapter 12 outlines
a set of methods that can be used to explain the behavior of a system to its
operators and other stakeholders.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
2 System Modeling

2.1 Coming Soon


3 Property Specification

In the previous chapter, we focused on creating an accurate model of the sys-


tem. The final step in defining a validation problem is to formalize the operating
requirements of the system as a specification, which is a precise mathematical
expression that defines the objectives of a system. Specifications are often de-
rived from metrics, which map the performance of a system to a real number.
We begin by discussing common metrics used to measure the performance of
stochastic systems. We also discuss how to create composite metrics that capture
trade-offs between different performance objectives. We then show how to write
specifications as logical formulas using propositional logic, first-order logic, and
temporal logic. Finally, we discuss a special case of a temporal specification called
a reachability specification and show how to convert temporal logic specifications
into reachability specifications.

3.1 Properties of Systems

We describe the behavior of a system using metrics and specifications. A metric is a


function that maps system behavior to a real number. For example, a common
metric used to evaluate aircraft collision avoidance systems is the miss distance
between two aircraft. A specification is a function that maps system behavior
to a Boolean value. Therefore, specifications are always either true or false. For
example, a specification for the grid world system might be to reach the goal
without hitting an obstacle.
Sometimes specifications can be derived from metrics. For example, given a
metric that measures the probability of collision for an aircraft collision avoidance
system, we can create a specification that requires the probability of collision to
be less than a certain threshold. We can also derive metrics from specifications.
22 c h ap ter 3 . p roperty specification

Using the grid world specification, we could define a metric that measures the
distance between the agent and the goal or obstacle.
We use metrics or specifications to evaluate individual trajectories, sets of
trajectories, or probability distributions over trajectories. The miss distance be- 400

tween two aircraft can be used to measure the performance of an aircraft collision 200
miss
avoidance system in a single encounter scenario (figure 3.1), and the net return

h (m)
0 distance
can be used to measure the performance of a one outcome of a financial trading 200
strategy over time. We can also create metrics or specifications that operate over
400
a set of trajectories. For example, we can compute the average miss distance or
net gain over a set of possible trajectories or specify a threshold on the number
400
of trajectories that result in a collision. The remainder of this chapter discusses
techniques to formally express metrics and specifications. 200
average

h (m)
0 miss distance

3.2 Metrics for Stochastic Systems 200

400
For stochastic systems, we often compute metrics over the full distribution of 40 30 20 10 0
trajectories. Given a function f (t ) that maps an individual trajectory t to a real- tcol (s)
valued metric, we are interested in summarizing the distribution over the output
Figure 3.1. Example of a metric for
of f (t ) (figure 3.2). The remainder of this section outlines several metrics used an aircraft collision avoidance sys-
to summarize distributions. tem over an individual trajectory
(top) and over a set of trajectories
(bottom).
3.2.1 Expected Value
A common metric used to summarize a distribution is its expected value. The
expected value represents the average output of a function given a distribution expected
miss distance
over its inputs. It is defined as
Z
E t ⇠ p(·) [ f (t )] = f (t ) p(t ) dt (3.1)

where p(t ) is the probability distribution over trajectories. While it is not always 0 1,000 2,000
possible to evaluate the expected value analytically, we can estimate it using a Miss Distance (m)

variety of techniques such as the ones discussed in chapter 7.


Figure 3.2. Distribution over the
The expected value of a binary metric represents a probability. For instance, the miss distance metric for an air-
consider a binary metric f (t ) that evaluates to 1 if an the agent hits an obstacle craft collision avoidance system.
We can summarize this distribu-
and 0 otherwise. The expected value of this metric is the probability that the agent tion with another metric such as
hits an obstacle. In general, the expected value of a binary metric derived from a the expected value of the miss dis-
tance.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
3.2. metrics for stochastic systems 23

Figure 3.3. Three distributions


over the miss distance metric for
an aircraft collision avoidance sys-
tem. While all distributions have
the same mean, they have differ-
ent variances (decreasing from left
to right). The distribution with the
lowest variance is most likely to op-
erate safely.
0 1,000 2,000 0 1,000 2,000 0 1,000 2,000
Miss Distance (m) Miss Distance (m) Miss Distance (m)

specification is the probability that a randomly sampled trajectory will satisfy the
specification. We could also derive a high-level specification from this probability
by requiring that the probability of satisfying the specification is greater than a
certain threshold.1 1
H. Hansson and B. Jonsson, “A
Logic for Reasoning about Time
and Reliability,” Formal Aspects of
3.2.2 Variance Computing, vol. 6, pp. 512–535,
1994.
Another common summary metric is the variance, which measures the spread of
the distribution. The variance of a metric f (t ) is defined as

Vart ⇠ p(·) [ f (t )] = E t ⇠ p(·) [( f (t ) E t ⇠ p(·) [ f (t )])2 ] (3.2)

Intuitively, the variance measures how much the metric f (t ) deviates from its
expected value. A low variance indicates that the metric tends to be consistent
across different trajectories, while a high variance indicates that the metric varies
significantly. It is important to consider both the expected value and variance of a
metric when evaluating system performance (figure 3.3).

3.2.3 Value at Risk


When we are concerned with safety, we may want to use more conservative
metrics that focus on worst-case outcomes. One such metric is the value at risk
(VaR). Suppose we have a metric f (t ) for individual trajectories in which higher
values indicate worse outcomes. This type of metric is often referred to as a risk
metric. The VaR is the highest risk value that f (t ) is guaranteed not to exceed
with probability a, which corresponds to the a-quantile of the distribution. For a
particular value of a, a higher VaR indicates a more risky system.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
24 c h ap ter 3 . p roperty specification

a = 0.9 a = 0.7 a = 0.5 a = 0.3 a = 0.1 Figure 3.4. Effect of a on VaR and
CVaR. Higher values for a corre-
expected value spond to more conservative risk es-
VaR timates.
CVaR

3.2.4 Conditional Value at Risk


Another common metric derived from VaR is the conditional value at risk (CVaR),2 2
The conditional value at risk is
which is the expected value of the metric f (t ) given that it exceeds the VaR: also known as the mean excess
loss, mean shortfall, and tail value
at risk. R. T. Rockafellar and S.
CVaRa [ f (t )] = E t ⇠ p(·) [ f (t ) | f (t ) VaRa [ f (t )]] (3.3) Uryasev, “Optimization of Con-
ditional Value-at-Risk,” Journal of
In other words, CVaR is the expected value of the (1 a)-fraction of worst-case Risk, vol. 2, pp. 21–42, 2000. It is
also a kind of coherent risk measure,
outcomes. A higher CVaR indicates that the system is more likely to perform which means that it satisfies some
poorly in the worst-case scenarios. Example 3.1 shows the VaR and CVaR of a additional mathematical proper-
risk metric for an aircraft collision avoidance system. Higher values of a push ties. Another coherent risk mea-
sure, not discussed here, is the en-
the VaR closer to the worst-case outcome and correspond to more conservative tropic value at risk. A. Ahmadi-Javid,
risk estimates. As a approaches 1, the CVaR approaches the risk of the worst-case “Entropic Value-At-Risk: A New
Coherent Risk Measure,” Journal
outcome. As a approaches 0, the CVaR approaches the expected value of the risk of Optimization Theory and Applica-
metric (figure 3.4). tions, vol. 155, no. 3, pp. 1105–1123,
2011.

3.3 Composite Metrics 1

In many real-world settings, we must select one of several system designs or 0.8
Collision Rate

strategies for final deployment, and metrics allow us to make an informed deci- 0.6
sion. For example, we might compare the performance of two aircraft collision
0.4
avoidance systems by computing the probability of collision over a set of aircraft
0.2
encounters for each system. In these cases, we are often concerned with multiple
metrics. For example, an aircraft collision avoidance system should minimize 0
0 0.2 0.4 0.6 0.8 1
collisions while issuing a small number of alerts to pilots, and a financial trading Alert Rate
strategy may aim to maximize return while minimizing risk.
Figure 3.5. Tradeoff between the
It is often the case that multiple metrics describing system performance are at
alert rate and collision rate for an
odds with one another, and some system designs may perform well on one metric aircraft collision avoidance system.
but poorly on another. For instance, an aircraft collision avoidance system that Each point represents a different
system design.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
3.3. composite metrics 25

Suppose a desired separation for the aircraft in an aircraft collision avoid- Example 3.1. VaR and CVaR for
the loss of separation metric for an
ance environment is 2,000 m. We can define a risk metric f (t ) to summarize aircraft collision avoidance system.
the loss of separation as 2,000 m minus the miss distance. A higher loss of
separation indicates higher risk. The plots below show the VaR and CVaR for
the loss of separation metric for three different distributions over outcomes.

expected value
VaR
CVaR

0 1,000 2,000 0 1,000 2,000 0 1,000 2,000


Loss of Separation (m) Loss of Separation (m) Loss of Separation (m)

Although all three distributions have the same expected value, the VaR and
CVaR decrease as we move from left to right. The distribution with the lowest
VaR and CVaR is the least risky because it has better worst-case outcomes.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
26 c hap ter 3 . p roperty specification

minimizes the number of collisions may also increase the number of alerts issued
to pilots, while one that minimizes alerts may increase the number of collisions 1
(figure 3.5). In such cases, we can combine multiple metrics into a single composite 0.8

Collision Rate
metric that captures the trade-offs between different objectives.
0.6
We can compare systems with multiple metrics using the concept of Pareto
optimality. A system design is Pareto optimal3 if we cannot improve one metric 0.4

without worsening another. Given a set of system designs, the Pareto frontier 0.2
consists of the subset of designs that are Pareto optimal. The Pareto frontier 0
0 0.2 0.4 0.6 0.8 1
illustrates the trade-offs between metrics. Figure 3.6 shows the Pareto frontier
Alert Rate
for the aircraft collision avoidance systems shown in figure 3.5. Composite metrics
allow system designers to select a single point on the Pareto frontier. Figure 3.6. Pareto frontier for a set
of aircraft collision avoidance sys-
tem designs. The points that com-
3.3.1 Weighted Metrics prise the Pareto frontier are high-
lighted in blue.
Weighted metrics combine multiple metrics using a vector of weights that re- 3
Pareto optimality is a topic that
flect the relative importance of each metric. Suppose we have a set of metrics was originally explored in the field
of economics. It is named after
f 1 (t ), f 2 (t ), . . . , f n (t ) that we wish to combine into a single metric. The most Italian economist Vilfredo Pareto
basic weighted metric is the weighted sum, which is defined as (1848–1923).

n
f (t ) = Â wi f i ( t ) = w> f ( t ) (3.4)

Collision Rate
i =1

where w = [w1 , . . . , wn ] is a vector of weights and f(t ) = [ f 1 (t ), . . . , f n (t )] is a


vector of metrics. The weighted sum allows us to balance the trade-offs between
different metrics by adjusting the weights, and each set of weights will correspond
to a point or set of points on the Pareto frontier. 0 0.2 0.4 0.6 0.8 1
Alert Rate

3.3.2 Goal Distance Metrics Figure 3.7. Composite metric for


an aircraft collision avoidance sys-
Another way to combine metrics is to compute the L p norm4 between f(t ) and a tem using the L2 norm between the
point and the goal point (blue star).
goal point:
The goal point is the utopia point
f (t ) = kf(t ) f goal k p (3.5) of no alerts and no collisions. The
color of each point represents the
where f goal is typically selected to be the utopia point. The utopia point is the point value of the composite metric with
in metric space that represents the best possible outcome for each metric. While the selected point highlighted in
green.
the utopia point is often unattainable, it provides a reference point for comparing 4
An overview of the L p norm op-
different system designs. Figure 3.7 shows an example of the goal metric for the erator is provided in appendix B.
aircraft collision avoidance problem.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
3.3. composite metrics 27

Suppose we want to create a composite metric for an aircraft collision avoid- Example 3.2. Using the weighted
sum composite metric to select an
ance system that balances the alert rate and collision rate. Using the weighted aircraft collision avoidance system
sum method, we define the composite metric as the weighted sum of the design along the Pareto frontier.
alert rate and collision rate. Selecting a weight vector then allows us to choose
a point on the Pareto frontier. The plots below show the Pareto frontier for
two different weight vectors. The first weight vector (w1 = [0.8, 0.2]) gives
more weight to minimizing the alert rate, while the second weight vector
(w2 = [0.2, 0.8]) gives more weight to minimizing the collision rate. The
points are colored according to the value of the composite metric.

0.8 w1
Collision Rate

0.6
w2
0.4

0.2

0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Alert Rate Alert Rate

The weight vector will be perpendicular to the Pareto frontier at the best
design point. The weight vector w1 is shown in blue for the first design point
and w2 is shown in blue for the second design point. The best design points
are highlighted in green.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
28 c h ap ter 3 . p roperty specification

The weighted exponential sum is composite metric that combines the weighted
sum and goal metrics as follows:
n
f (t ) = Â w i ( fi ( t ) f goal ) p (3.6)
i =1

where p 1 is an exponent similar to that used in L p norms. The weights wi must


be positive and sum to 1. The weighted exponential sum allows us to balance the
trade-offs between different metrics while also considering the distance to the
utopia point. Other more sophisticated weighting methods such as the weighted
min-max metric and the exponential weighted metric build on these ideas.5 5
For more information on com-
posite metrics, see T. W. Athan
and P. Y. Papalambros, “A Note
3.3.3 Preference Elicitation on Weighted Criteria Methods for
Compromise Solutions in Multi-
Creating a composite metric using weights requires us to specify the relative Objective Optimization,” Engineer-
ing Optimization, vol. 27, no. 2,
importance of each metric. However, even domain experts may have difficulty pp. 155–176, 1996.
translating their preferences to a set of precise numerical weights. Preference
elitication allows us to infer a set of weights based on expert responses to a set 1
of preference queries. For example, we might present a domain expert with a 0.8
pairwise query containing the metrics of two possible system designs and ask

wcollision
0.6
them to select the preferred design. By repeating this process for multiple different
pairwise queries of system designs, we can infer the weight that the expert assigns 0.4

to each metric. 0.2


In this section, we focus on inferring the weights of a weighted sum composite 0
0 0.2 0.4 0.6 0.8 1
metric using pairwise queries. There are other schemes for eliciting preferences, walert
such as ranking multiple system designs, but pairwise queries have been shown
to pose minimal cognitive burden on the expert.6 . We will also restrict ourselves Figure 3.8. The space of possible
weights for the aircraft collision
to weight vectors with positive entries that sum to a value less than or equal to 1. avoidance weighted sum metric.
Figure 3.8 shows the space of possible weights for the aircraft collision avoidance 6
V. Conitzer, “Eliciting Single-
example. Peaked Preferences Using Compar-
ison Queries,” Journal of Artificial In-
Suppose we query the expert with a pair of metric vectors f1 and f2 and find telligence Research, vol. 35, pp. 161–
that the expert prefers f 1 to f 2 . For the weighted sum metric to be consistent with 191, 2009.
the preference, we must select a weight vector w such that

w > f1 < w > f2 (3.7)


>
w ( f1 f2 ) < 0 (3.8)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
3.3. composite metrics 29

where we assume that lower values for the composite metric are preferable.7 In 7
If higher values are preferable,
effect, the response to the query further constrains the space of possible weight the inequality in equation (3.8)
should be reversed.
vectors (example 3.3).

Suppose we want to infer the weights for a composite metric that combines the Example 3.3. The effect of a prefer-
ence query on the space of possible
alert rate and collision rate for an aircraft collision avoidance system. When weight vectors for the aircraft colli-
we query a domain expert or stakeholder with system designs f1 = [0.8, 0.4] sion avoidance example.
and f2 = [0.4, 0.8], we find that the expert prefers f1 to f2 . In other words,
the expert prefers the system design with the higher alert rate and lower
collision rate. Since the weight vector must be consistent with this preference
(equation (3.8)), we can further constrain the space of possible weight vectors
as shown in the figure below.

1
f2 f2
0.8
w T f1 < w T f2
wcollision

0.6
f1 f1
0.4

0.2 w T f1 > w T f2
w T f1 = w T f2
0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
walert walert walert

The purple shaded region in the center plot shows the space of possible
weight vectors consistent with the expert’s preference. The plot on the right
shows the space of possible weight vectors consistent with the expert’s prefer-
ence and the constraint that the weights must sum to 1. We can further refine
the space of possible weight vectors by querying the expert with additional
pairs of system designs.

By querying the expert with multiple pairs of system designs, we can iteratively
refine the space of possible weight vectors (figure 3.9). To minimize the number
of times we must query the expert, it is common to select pairs of system designs 8
This method is known as Q-
that with maximally reduce the space of possible weights. For example, one Eval. V. S. Iyengar, J. Lee, and M.
Campbell, “Q-EVAL: Evaluating
method is to select the query that comes closest to bisecting the space of possible Multiple Attribute Items Using
weights.8 After querying the expert a desired number of times, we can select a Queries,” in ACM Conference on
Electronic Commerce, 2001.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
30 c h ap ter 3 . p roperty specification

Figure 3.9. The effect of multiple


preference queries on the space of
Query 1 Query 2 Query 3 Final Weight Space possible weight vectors for the air-
1 craft collision avoidance example.
f2
f2 f2 The blue shaded regions show the
0.8 f1
space of possible weight vectors
wcollision

0.6 f1 before obtaining the expert’s pref-


f1
0.4 erence, and the purple shaded re-
gions show the weight vectors con-
0.2
sistent with the expert’s preference.
0 The space of possible weight vec-
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
tors before the next query is the in-
walert walert walert walert tersection of these regions.

set of weights from the refined weight space to create a composite metric that
reflects the expert’s preferences. While we could select any value for w that is
consistent with the expert’s responses, it is common to select the weight vector
that maximally separates the system designs that were presented to the expert.

3.4 Logical Specifications

A logical specification y formally defines an operating requirement for a system


using a logical formula. A logical formula is a precise expression that evaluates to
either true or false. Logical specifications can be used to describe requirements
for both individual trajectories and trajectory distributions. For example, a logical
specification on an individual trajectory for an aircraft collision avoidance system
might check whether the aircraft collide at any point in the trajectory. A logical
specification over the entire distribution of aircraft collision avoidance trajectories
might require that the probability of collision is less than a certain threshold. We
can express logical formulas using several different types of logic. This section
introduces two common types of logic.

3.4.1 Propositional Logic


Propositional logic constructs logical formulas by connecting propositions using 9
A detailed overview of propor-
logical operators.9 A proposition is a statement that is either true or false. The basic tional logic is provided by M. Huth
building block of propositional logic is an atomic proposition, which is a proposition and M. Ryan, Logic in Computer
Science: Modelling and Reasoning
that cannot be further decomposed. The two most basic logical expressions are about Systems. Cambridge Univer-
sity Press, 2004.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
3.4. logical specifications 31

Suppose we wish to express the following statement using propositional Example 3.4. Constructing a
propositional logic formula from
logic: ‘‘If the agent is in a safe state, then the agent is not in a collision a statement.
state.’’ Let the variable S represent whether the agent is in a safe state and C
represent whether the agent is in a collision state. The propositional logic
statement is S ! ¬C (read as ‘‘S implies not C’’). In this statement, S and C
are atomic propositions because they cannot be broken down further. The
logical formula S ! ¬C is itself a proposition that can be combined with
other propositions to create more complex formulas.

Expression Explanation Construction Table 3.1. Propositional logic op-


erators and their equivalent con-
¬P Negation (Not): Inverts a Boolean value. —
struction using negation and con-
P^Q Conjunction (And): Evaluates to true if both P and Q are true. — junction. The constructions build
P_Q Disjunction (Or): Evaluates to true if either P or Q are true. ¬(¬ P ^ ¬ Q) from previous expressions for con-
P!Q Implication: Evaluates to true unless P is true and Q is false. ¬P _ Q venience (e.g., the use of _ in im-
P$Q Biconditional: Evaluates to true when both P and Q are equivalent. ( P ^ Q) _ (¬ P ^ ¬ Q) plication).

negation (‘‘not’’) and conjunction (‘‘and’’). All other logical expressions such as
disjunction (‘‘or’’), implication (‘‘if-then’’), and biconditional (‘‘if and only if’’) can
be constructed using negation and conjunction. Example 3.4 demonstrates the
construction of a propositional logic formula from a statement.
Table 3.1 shows the propositional logic operators and their construction using
negation and conjunction. We can describe propositional logic formulas using
truth tables, which show the value of the formula as a function of its inputs.
Figure 3.10 shows truth tables for each of the basic propositional logic operators.
Logical operators can also be illustrated as logic gates (figure 3.11), which are
fundamental building blocks for digital circuits.10 Example 3.5 implements the 10
R. Page and R. Gamboa, Essen-
logical operators as functions in Julia. tial Logic for Computer Science. MIT
Press, 2019.

3.4.2 First-Order Logic


First-order logic extends propositional logic by introducing the notion of predicates
and quantifiers.11 It uses variables to represent objects in a domain and predicate 11
First-order logic is also known
functions to evaluate propositions over these objects. For example, we could create as predicate logic. M. Huth and
M. Ryan, Logic in Computer Science:
a variable x to represent the state of an agent and a predicate function P( x ) Modelling and Reasoning about Sys-
that returns true if the agent is in a safe state and false otherwise. We combine tems. Cambridge University Press,
2004.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
32 c h ap ter 3 . p roperty specification

Figure 3.10. Truth tables for the


P Q P^Q P Q P_Q
propositional logic operators using
P ¬P atomic propositions P and Q. The
false false false false false false
truth tables show the outputs of
false true false true false false true true each logical operator for all possi-
true false true false false true false true ble combinations of Boolean values
for P and Q.
true true true true true true

P Q P_Q P Q P$Q
false false false false false true
false true true false true false
true false true true false false
true true true true true true

P P
Figure 3.11. Logical operators rep-
P ¬P
Q
P^Q
Q
P_Q
resented using logic gates.

AND gate OR gate NOT gate

P P$Q

P
P!Q Q
Q

IMPLICATION gates BICONDITIONAL gates

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
3.5. temporal logic 33

Consider two atomic propositions, P and Q. The basic operations of negation Example 3.5. Julia implementa-
tions of propositional logic oper-
(!), conjunction (&&), and disjunction (||) are already implemented in most ators.
programming languages including Julia. Implication P ! Q can be defined
as the operator ⟶ given the Boolean values of P and Q:
julia> ⟶(P,Q) = !P || Q # \longrightarrow<TAB>
⟶ (generic function with 1 method)
julia> P = true;
julia> Q = false;
julia> P ⟶ Q
false

For the biconditional P $ Q, we can use the == sign:


julia> P = false;
julia> Q = false;
julia> P == Q
true

predicates to create propositions using logical operators. For instance, if we have


a predicate function Q( x ) that returns true if the agent is in a collision state, we
can create the proposition P( x ) ! ¬ Q( x ) to express that the agent is not in a
collision state when it is in a safe state.
Quantifiers allow us to evaluate propositions over a collection of variables. The
universal quantifier 8 (‘‘for all’’) returns true if all variables in the domain satisfy
the proposition. The existential quantifier 9 (‘‘there exists’’) returns true if at least
one variable in the domain satisfies the proposition. These quantifiers allow us to
create specifications over full system trajectories by setting the domain to be the
set of all states in the trajectory. Example 3.6 demonstrates the use of quantifiers
to define an obstacle avoidance specification over a trajectory.

3.5 Temporal Logic

Temporal logic extends first-order logic to specify properties over time. It is partic-
ularly useful for specifying properties of dynamical systems because it allows us
to describe how trajectories should evolve. This section outlines three common
types of temporal logic.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
34 c h ap ter 3 . p roperty specification

Let x be a variable that represents the state of the agent in the grid world Example 3.6. Universal and ex-
istential quantifiers for an obsta-
problem where we must avoid an obstacle (red), and define the domain X cle avoidance problem. The red re-
as the set of states that comprise a particular trajectory. We define a predicate gion indicates an obstacle while the
green region indicates the goal.
function O( x ) that evaluates to true if x is an obstacle state and false otherwise.
To define a specification y1 that states ‘‘for all states in the trajectory, the
agent does not hit an obstacle,’’ we can use the formula:

y1 = 8 x ¬O( x )

The examples below show evaluations of y1 for two different trajectories.

y1 = true y1 = f alse

Suppose we also want the agent to reach a goal state while avoiding the
obstacle. We can create an additional predicate G ( x ) that evaluates to true
if x is a goal state and false otherwise. We then create y2 to represent the
statement ‘‘for all states in the trajectory, the agent does not hit an obstacle
and there exists a state in the trajectory in which the agent reaches the goal’’
using the following formula:

y2 = (8 x ¬O( x )) ^ (9 x G ( x ))

The examples below show evaluations of y2 for two different trajectories.

y2 = true y2 = f alse

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
3.5. temporal logic 35

Until: PU Q Figure 3.12. Examples of the binary


temporal operator until and unary
Always: ⇤ P Eventually: ⌃ P temporal operators eventually and
P
always. The temporal operator is de-
P P
Q fined from time t to the end of the
⇤P ⌃P sequence. Note that always holds as
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 long as each subsequent y is true.
PU Q
1 2 3 4 5 6 7 8 9 10 time time
time

3.5.1 Linear Temporal Logic


Linear temporal logic (LTL) is a type of temporal logic that assumes a linear sequence
of states.12 It introduces three main temporal operators.13 Given a proposition 12
A. Pnueli, “The Temporal Logic
of Programs,” in Symposium on
P, the always (⇤ P) operator specifies that P must be true at all time steps in the
Foundations of Computer Science
future. The eventually (⌃ P) operator requires that P be true at some point in the (SFCS), 1977. Computation tree logic
future. Given another proposition Q, the until (P U Q) operator specifies that P (CTL) is another common tem-
poral logic that operates over
must be true at least until Q becomes true. multiple future paths. A detailed
Table 3.2 outlines the three LTL operators and their construction, and algo- overview is provided in C. Baier
rithm 3.1 evaluates LTL specifications over the sequence of states in a trajectory. and J.-P. Katoen, “Principles of
Model Checking,” in MIT Press,
The until operator can be written using first-order logic quantifiers, and the other 2008, ch. 6.
two operators build on the until operator. Figure 3.12 shows the values of these 13
Other common operators include
next and weak until.
operators over a trajectory, and example 3.7 shows how to construct an LTL
specification for the grid world problem.

Expression Explanation Construction Table 3.2. LTL operators. The


propositions Pt and Qt represent
PU Q Until: P is true at least until Q becomes true. 9t Qt ^ 8 t 0 (0  t0 < t)¬ Q0t ! Pt0
whether P and Q are true at at time
⌃P Eventually: P will be true at some time in the future. >U P t, and the > symbol indicates static
⇤P Always: P is true at every time in the future. ¬⌃(¬ P) truth.

struct LTLSpecification <: Specification Algorithm 3.1. Definition of LTL


formula # formula specified using SignalTemporalLogic.jl specification. The formula is eval-
end uated over the sequence of states
evaluate(ψ::LTLSpecification, τ) = ψ.formula([step.s for step in τ]) in the trajectory starting at the first
time step.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
36 c h ap ter 3 . p roperty specification

For a navigation problem, let y be the LTL property specification that states Example 3.7. LTL formula for an
obstacle avoidance problem where
‘‘eventually reach the goal after passing through the checkpoint and always avoid the a blue checkpoint must be reached
obstacle.’’ First, we define the following predicate functions: before the green goal while avoid-
ing the red obstacle.
F (st ) : the state s at time t contains an obstacle y = true
G (st ) : the state s at time t is the goal
C (st ) : the state s at time t is the checkpoint

The specification can be defined using LTL as follows:

y = ⌃ G ( s t ) ^ ¬ C ( s t ) U G ( s t ) ^ ⇤¬ F ( s t )

This formula requires that the agent reaches the goal (⌃G (st )) but that the
goal is not reached until the checkpoint (¬ G (st ) U C (st )). Additionally, the
agent must always avoid obstacles (⇤¬ F (st )). The figure in the caption
shows an example trajectory that satisfies this specification. The following
code constructs the LTL specification:
F = @formula sₜ -> sₜ == [5, 5]
G = @formula sₜ -> sₜ == [7, 8]
C = @formula sₜ -> sₜ == [8, 3]
ψ = LTLSpecification(@formula ◊(G) ∧ 𝒰(¬G, C) ∧ □(¬F))

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
3.5. temporal logic 37

3.5.2 Signal Temporal Logic


Signal temporal logic (STL) extends LTL to specify properties over signals.14 A 14
STL was first introduced in O.
signal is a real-valued sequence of points in discrete time that represent the state Maler and D. Nickovic, “Monitor-
ing Temporal Properties of Con-
of a system over time.15 STL introduces two key extensions to LTL to handle real- tinuous Signals,” in International
valued signals. The first extension is the ability to specify properties over a time Symposium on Formal Techniques in
Real-Time and Fault-Tolerant Systems,
interval [ a, b]. For example, we can write ⌃[ a,b] P to specify that P will eventually 2004.
be true within the time interval [ a, b].16 15
These points may be sampled at
The second extension is the introduction of predicates that map real-valued regular or irregular intervals from
a continuous-time function.
signals to truth values. Specifically, it introduces the predicate µc (st ) that returns 16
When a time range is omitted, we
true if assume the positive time path of
µ(st ) > c (3.9) [0, •).

where µ(·) is a real-valued function that operates on the state (example 3.8).
Table 3.3 defines the specifications for the continuum world, inverted pendulum,
and collision avoidance example problems using STL. Algorithm 3.2 provides a
framework for evaluating STL specifications over a trajectory given a time interval.

Suppose we want to implement the following STL formula in code: Example 3.8. Julia implementation
of an STL formula.
‘‘eventually the signal will be greater than 0.5.’’ We can use the
SignalTemporalLogic.jl package to define the predicate µ and the formula
y as follows:
julia> using SignalTemporalLogic
julia> τ = [-1.0, -3.2, 2.0, 1.5, 3.0, 0.5, -0.5, -2.0, -4.0, -1.5];
julia> μ = @formula sₜ -> sₜ > 1.0;
julia> ψ = @formula ◊(μ);
julia> ψ(τ) # check if formula is satisfied
true

The formula is satisfied since the signal eventually becomes greater than 1.

struct STLSpecification <: Specification Algorithm 3.2. Definition of an STL


formula # formula specified using SignalTemporalLogic.jl specification for an interval.
I # time interval (e.g. 3:10)
end
evaluate(ψ::STLSpecification, τ) = ψ.formula([step.s for step in τ[ψ.I]])

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
38 c hap ter 3 . p roperty specification

System Property Implementation

Continuum World
‘‘Reach the goal without
hitting the obstacle’’ G = @formula s->norm(s.-[6.5,7.5])≤0.5
G (st ): st is in the goal region F = @formula s->norm(s.-[4.5,4.5])≤0.5
F (st ): st is in the obstacle region ψ = @formula ◊(G) ∧ □(¬F)
y = ⌃ G ( s t ) ^ ⇤¬ F ( s t )

Inverted Pendulum
p/4 p/4 ‘‘Keep the pendulum balanced’’
q B = @formula s->abs(s[1])≤π/4
B(st ): |qt |  p/4
ψ = @formula □(B)
y = ⇤ B(st )

Aircraft Collision Avoidance


400

200 ‘‘Ensure at least 50 meters relative


altitude between 40 and 41 seconds’’ S = @formula s->abs(s[1])≥50
h (m)

0
S(st ): | ht | 50 ψ = @formula □(40:41, S)
200
y = ⇤[40,41] S(st )
400
1 11 21 31 41
Time (s)

Table 3.3. Signal temporal logic


formulas for three of the exam-
ple problems used throughout the
book.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
3.5. temporal logic 39

One benefit of expressing properties using STL is the ability to calculate a


robustness metric using the specification. Robustness measures how ‘‘close’’ a
signal is to satisfying a specification. For example, the robustness of the predicate
µc (st ) is defined as
r(st , µc ) = µ(st ) c (3.10)
If the predicate is false for st , the robustness will be negative, and if it is true, the
robustness will be positive. The signal becomes closer to satifying the specification
as the robustness approaches zero and further from not satisfying the specification
as the robustness increases from zero.
Given propositions P and Q that correspond to predicates µc (st ) and µd (st ),
we can also define robustness formulas for the propositional logic operators ¬ P,
P ^ Q, P _ Q, and P ! Q as follows:

r(st , ¬ P) =r(st , P) (3.11)


⇣ ⌘
r(st , P ^ Q) = min r(st , P), r(st , Q) (3.12)
⇣ ⌘
r(st , P _ Q) = max r(st , P), r(st , Q) (3.13)
⇣ ⌘
r(st , P ! Q) = max r(st , P), r(st , Q) (3.14)

Intuitively, the robustness of a conjunction is the minimum of the robustness


of its components since both components must hold, and the robustness of a
disjunction is the maximum of the robustness of its components since only one
component must hold.
We can also define robustness over the temporal operators:

r(st , ⌃[ a,b] P) = max r ( s t0 , P ) (3.15)


t0 2[t+ a,t+b]

r(st , ⇤[ a,b] P) = min r ( s t0 , P ) (3.16)


t0 2[t+ a,t+b]
✓ ⇣ ⌘◆
r(st , P U [ a,b] Q) = max min r(st0 , Q), min r(st00 , P) (3.17)
t0 2[t+ a,t+b] t00 2[t,t0 ]

In general, the robustness of a temporal operator is the maximum or minimum


of the robustness of its components over the specified time interval. We take
the maximum over all time steps for the eventually operator to get the best-case
signal because the signal must satisfy the property at only one time step in the
interval. Conversely, we take the minimum over all time steps for the always

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
40 c hap ter 3 . p roperty specification

Let µ0 (st ) be a predicate function that is true if st is greater than 0. The Example 3.9. Robustness of the for-
mulas y1 = ⌃µ0 and y2 = ⇤µ0
following code computes the robustness of the formulas ⌃µ0 and ⇤µ0 over a over a signal t.
signal t:
4
julia> using SignalTemporalLogic
r1
julia> τ = [-1.0, -3.2, 2.0, 1.5, 3.0, 0.5, -0.5, -2.0, -4.0, -1.5]; 2
julia> μ = @formula sₜ -> sₜ > 0.0;
julia> ψ₁ = @formula ◊(μ); 0

s
julia> ρ₁ = ρ(τ, ψ₁)
3.0 2
julia> ψ₂ = @formula □(μ); r2
julia> ρ₂ = ρ(τ, ψ₂) 4
-4.0
2 4 6 8 10
Time
The robustness of the formula ⌃µc is the maximum different between
the signal and the threshold. We would have to decrease all of our signal
values by at least this value to make the formula false. The robustness of the
formula ⇤µc is the minimum difference between the signal and the threshold.
We would have to increase all of our signal values by at least this value to
make the formula true. The figure in the caption shows signal values that
determine the robustness for each formula.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
3.6. reachability specifications 41

operator because the signal must satisfy the property at all time steps in the
interval. Example 3.9 demonstrates this concept.
We can use the robustness metric to assess how close a given system trajectory is
to a failure. Furthermore, if we are able to compute the gradient of the robustness
metric with respect to certain inputs to the system, we can understand how these
inputs affect the overall safety of the system. We will use this idea throughout
the book to understand system behavior. For example, we can uncover the failure
modes of a system by using the robustnes metric to guide the simultor towards a
failure trajectory (see chapter 4 for more details).
Taking the gradient of the robustness metrics requires that the robustness
formula is differentiable over the input space. However, the min and max func-
tions that commonly occur in STL fomulas are not differentiable everywhere.
To address this challenge, we can use smooth approximations of the min and
max functions, such as the softmin and softmax functions, respectively.17 These 17
K. Leung, N. Aréchiga, and
functions are defined as M. Pavone, “Backpropagation
Through Signal Temporal Logic
Âid si exp( si /w) Specifications: Infusing Logical
softmin(s; w) = (3.18) Structure into Gradient-Based
Âdj exp( s j /w) Methods,” The International Journal
of Robotics Research, vol. 42, no. 6,
Âid si exp(si /w) pp. 356–370, 2023.
softmax(s; w) = (3.19)
Âdj exp(s j /w)
4
where s is a signal of length d and w is a weight. As w approaches infinity, the r1
2
softmin and softmax functions approach the mean function. As w approaches
r̃1
zero, the softmin and softmax functions approach the min and max functions 0


(figure 3.13). We call the robustness metric that uses the softmin and softmax 2
functions the smooth robustness metric. Figure 3.14 shows the gradient of the 4
smooth robustness metric for different values of w.
0 5 10 15
w

3.6 Reachability Specifications Figure 3.13. Smooth robustness


metric r̃ for the formula in exam-
A reachability specification is a special type of temporal logic specification that ple 3.9. The robustness metric r1 is
shown as a blue dashed line, and
describes a state or set of states that a system should or should not reach during
the mean of the points in the trajec-
its execution. Let S T ✓ S represent the target set of states and define the predicate tory is shown in gray. When w = 0,
function R(st ) to be true if st 2 S T and false otherwise. If our goal is to reach the the smooth robustness metric r̃1 is
equal to the robustness metric r1 .
target set, the reachability specification has the following form: When w is large, the smooth robust-
ness metric approaches the mean
y = ⌃ R(st ) (3.20) of the trajectory.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
42 c h ap ter 3 . p roperty specification

w=0 w=1 w=2 w=5 w = 20 Figure 3.14. The gradient of the ro-
bustness function for the formula
in example 3.9 with respect to the
signal values for different values of
w used in the smooth robustness
rr̃

metric r̃. When w = 0, the gradient


is only nonzero at the point corre-
sponding to the maximum robust-
Time Step Time Step Time Step Time Step Time Step
ness. As w increases, the gradient
becomes nonzero at all points in
the trajectory. Since the smooth ro-
bustness approaches the mean of
If our goal is to avoid the target set, we write the reachability specification as the trajectory as w increases, the
gradient becomes more uniform.
y = ¬⌃ R ( s t ) = ⇤¬ R ( s t ) (3.21)

Writing specifications in this form is useful because many algorithms related


to formal methods and model checking are centered around reachability speci-
fications. For example, the algorithms in chapters 8 to 10 determine whether a
system could reach a target set. For some systems, such as the inverted pendulum
system example 3.10, the reachability specification is the most natural way to
express the desired behavior. However, it is possible to convert other types of
specifications into reachability specifications using various techniques. In fact, we
can convert any LTL specification into a reachability specification by augmenting
the state space of the system.

Let S T be the set of states for the inverted pendulum system where the Example 3.10. Reachability specifi-
cation for the inverted pendulum
pendulum has tipped over. In other words, S T is the set of states where the system.
angle q is outside the range [ p/4, p/4]. Our goal is to avoid reaching this
set of states, so we define the reachability specification as

y = ¬⌃ R ( s t ) (3.22)

where R(st ) is the predicate function that checks if the state is in the target
set.

The first step in converting an LTL specification into a reachability specification


is to represent the LTL formula as a Büchi automaton.18 A Büchi automaton consists 18
Büchi automata are named af-
of a set of states Q, an initial state q1 2 Q, a set of atomic propositions P, a ter Swiss mathematician Julius
Richard Büchi (1924–1984).
transition function d, and a set of accepting states. The transition function d maps

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
3.6. reachability specifications 43

a state and an instantiation of truth values for the atomic propositions to the next
state. The accepting states are the states that are visited infinitely often when the
automaton accepts an infinite sequence of states. Example 3.11 shows a simple
Büchi automaton with two states and two propositions.

The figure below shows a simple Büchi automaton that accepts an infinite Example 3.11. Example of a Büchi
automaton with two states and two
sequence of states if the sequence satisfies the LTL formula A ^ B. atomic propositions.

¬( A ^ B) >

A^B
start q1 q2

The automation has two states Q = {q1 , q2 }, where q1 is the initial state and
q2 is the accepting state. The automation has two atomic propositions A and
B. The transition function is defined for all possible combinations of truth
values for the automic propositions:

d ( q1 , A ^ B ) = q2
d ( q1 , A ^ ¬ B ) = q1
d ( q1 , ¬ A ^ B ) = q1
d ( q1 , ¬ A ^ ¬ B ) = q1
d ( q2 , ) = q2

The diagram above compactly summarizes the transition function as d(q1 , A ^


B) = q2 and d(q1 , ¬( A ^ B)) = q1 . The accepting state is denoted using the
double circle.
19
More details are provided in C.
It is possible to represent any LTL formula as a Büchi automaton (exam- Baier and J.-P. Katoen, Principles of
Model Checking. MIT Press, 2008.
ple 3.12).19 The accepting trajectories of the Büchi automaton satisfy the cor- Open source software packages
responding LTL formula. To obtain a reachability specification from the Büchi such as Spot can be used to do the
conversion automatically. A. Duret-
automaton, we must augment the state space of the system of interest. The new Lutz, “Manipulating LTL Formu-
state space is the product of the states of the system and the states of the Büchi las Using Spot 1.0,” in Automated
automaton: Technology for Verification and Analy-
sis, 2013. The Spot.jl package pro-
(s, q) 2 S ⇥ Q (3.23) vides an interface to the Spot li-
brary.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
44 c h ap ter 3 . p roperty specification

Suppose we have an LTL formula that specifies that we need to visit a check- Example 3.12. Conversion of an
LTL formula to a Büchi automaton.
point before reaching a goal, written as

⌃G ^ ¬ G U C

where G is an atomic proposition that represents whether we reach the


goal and C is an atomic proposition that represents whether we reach the
checkpoint. We can convert this formula into the Büchi automaton using
Spot.jl as follows:
using Spot
a = translate(LTLTranslator(), ltl"◊(G) ∧ ¬G𝒰C")

The resulting automaton is shown below. It has 4 states and the same
atomic propositions as the LTL formula. The accepting state is q4 , and the
LTL formula is satisfied if the automaton visits q4 infinitely often, or in other
words, if a trajectory reaches q4 . The state q2 represents the state where the
agent has reached the goal but has not reached the checkpoint. Once this
state has been reached, the agent will remain in this state forever with no
chance of reaching the accepting state and satisfying the LTL formula. This
state is often ommited in practice to reduce the size of the automaton.

>

¬C ^ ¬ G >
q2
¬C ^ G
C^G
start q1 q4
C ^ ¬G G

q3

¬G

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
3.7. summary 45

The transition model for the new state space is defined by the transition model of
the system T and the transition model of the Büchi automaton d:
8
< T (s0 | s, a) if q0 = d(q, L(s))
T ((s0 , q0 ) | (s, q), a) = (3.24)
:0 otherwise

where L(s) is a labeling function that maps a state s to values for the atomic
propositions of the Büchi automaton. For example, a labeling function for the
system in example 3.12 would map the state st to the values of that specify whether
it is a goal state or checkpoint state.
We refer to the system with the augmented state space as the product system.
The reachability specification for the product system is

y = ⌃ R((st , qt )) (3.25)

where R((st , qt )) is a predicate function that returns true if qt is an accepting


state of the Büchi automaton and false otherwise. Checking whether the product
system satifies the reachability specification is equivalent to checking whether
the original system satisfies the LTL formula. Figure 3.15 shows the product
system with the grid world as the original system and the LTL specification in
example 3.12.

3.7 Summary

• Metrics and specifications allow us to quantify and express the desired behavior
of a system.

• For stochastic systems, we often compute metrics over the full distribution of
possible outcomes.

• In situations where we are interested in multiple metrics, we can create a


composite metric that accounts for the relative importance of each metric.

• Logical specifications allow us to formally express requirements for a system


using logical formulas.

• Propositional logic and first-order logic allow us to express properties over a


set of propositions.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
46 c h ap ter 3 . p roperty specification

> Figure 3.15. Converting an LTL


specfication (see example 3.12) for
the grid world problem to a reach-
¬C ^ ¬ G > ability specification by creating
q2 a product system with an aug-
¬C ^ G
mented state space. The original
C^G system is shown on the left, the
start q1 q4
Büchi automaton is shown in the
C ^ ¬G G
middle, and the product system
q3 is shown on the bottom. We start
in the gray grid world until we ei-
ther reach the checkpoint (blue) or
goal (green). If we reach the goal
¬G in the gray grid world, we transi-
tion to the red grid world and re-
Original System Büchi Automaton main there forever. If we reach the
checkpoint in the gray grid world,
we transition to the blue grid world
and remain there until we reach
the goal. If we reach the goal in
( q2 , s ) the blue grid world, we transition
to the green grid world, which
represents an accepting state for
the Büchi automaton. The dashed
green line in the Büchi automaton
does not appear in the product sys-
( q1 , s ) tem because we cannot reach the
checkpoint and goal at the same
( q4 , s ) time in the grid world system. The
set of target states for the reacha-
bility problem is the set of states in
the green grid world.

( q3 , s )

Product System
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
3.7. summary 47

• Temporal logic extends first-order logic to express properties about how sys-
tems evolve over time.

• Linear temporal logic (LTL) and Signal Temporal Logic (STL) are two common
temporal logics used in control and verification.

• Reachability specifications are a special type of temporal logic specification


that describe a state or set of states that a system should or should not reach
during its execution.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
4 Falsification through Optimization

The first set of validation algorithms we will explore relate to falsification. Fal-
sification is the process of finding trajectories of a system that violate a given
specification. Such trajectories are sometimes referred to as counterexamples, failure
trajectories, or falsifying trajectories. We will refer to them in this textbook as failures
for simplicity. The beginning of the chapter introduces a naïve algorithm for find-
ing failures based on direct sampling, with the rest of the chapter focused on more
sophisticated algorithms that use optimization techniques to guide the search
for failures. Optimization-based falsification relies on the concept of disturbances,
which control the behavior of the system. We demonstrate how to frame the
falsification problem as an optimization over disturbance trajectories and outline
several techniques to perform the optimization.

4.1 Direct Sampling

When performing falsification, we want to find any trajectory t that violates


a given specification y, written as t 62 y. Algorithm 4.1 uses direct sampling Figure 4.1. Monte Carlo falsifica-
to search for such trajectories.1 It performs m rollouts and returns all failure tion applied to the grid world prob-
lem with m = 100 and d = 50. The
trajectories. Figure 4.1 shows an example of direct falsification applied to the grid
probability of slipping is set to 0.8.
world problem. The algorithm samples 96 trajecto-
Algorithm 4.1 may struggle for systems with rare failure events. For a system ries before finding a failure. The
failure trajectory is shown in red.
with probability of failure pfail , we will require 1/pfail samples on average to 1
This type of sampling is often
observe a single failure. In fact, we can infer a distribution over the number of referred to as Monte Carlo sam-
samples required to find a failure. The probability of finding the first failure on pling, named after the Monte Carlo
casino in Monaco. Similar to gam-
the kth sample is equivalent to the probability of sampling k 1 successes with bling, the algorithm depends on
probability 1 pfail and one failure with probability pfail . We therefore write the random chance.
50 c hap ter 4. falsification throug h optimiz ation

struct DirectFalsification Algorithm 4.1. The direct falsifica-


d # depth tion algorithm for finding failures.
m # number of samples The algorithm performs rollouts to
end a depth d to generate m samples of
the system sys. It then filters these
function falsify(alg::DirectFalsification, sys, ψ) samples and returns the ones that
d, m = alg.d, alg.m violate the specification ψ. If no fail-
τs = [rollout(sys, d=d) for i in 1:m] ures are found, the algorithm re-
return filter(τ->isfailure(ψ, τ), τs) turns an empty vector.
end

probability mass function of the distribution as 0.2

P ( k ) = (1 pfail )k 1
pfail (4.1) 0.15

P(k)
0.1
where k 2 N.
0.05
Equation (4.1) corresponds to the probability mass function of a geometric
distribution with parameter pfail . Figure 4.2 shows an example of a geometric 0
1 2 3 4 5 6 7 8
distribution. The expected value of this distribution, 1/pfail , corresponds to the k
average number of samples required to find a failure. Example 4.1 illustrates this
Figure 4.2. The probability mass
relationship for the aircraft collision avoidance problem. Systems with very low function of a geometric distribu-
failure probabilities will require a large number of samples for direct falsification. tion with parameter pfail = 0.2.
The expected value of this distri-
For example, some aviation systems have failure probabilities on the order of bution is 1/pfail = 5.
10 9 . These systems require 1 billion samples on average to observe a single
failure event. The remainder of the chapter discusses more efficient falsification
techniques.

4.2 Disturbances

We can systematically search for failures by taking control of the sources of ran-
domness in the system. We control these sources of randomness using disturbances.
To incorporate disturbances into a system, we rewrite its sensor, agent, and en-
vironment models by breaking up their stochastic and deterministic elements.
For example, the observation model o ⇠ O(· | s) can be written as a deterministic
function of the current state s and a stochastic disturbance xo such that

o = O(s, xo ), xo ⇠ Do (· | s) (4.2)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
4.2. disturbances 51

Suppose we want to find failures of an aircraft collision avoidance system Example 4.1. Direct falsification ap-
plied to the aircraft collision avoid-
using direct falsification. In this scenario, a failure is a collision between two ance problem with different lev-
aircraft, which occurs when the relative altitude to the intruder aircraft h els of noise applied to the transi-
tions. There are four state variables
is within ±50 m and the time to collision tcol is zero. The collision avoid- for the collision avoidance problem.
ance environment applies additive noise with standard deviation s to the These plots show how two of these
relative vertical rate of the intruder aircraft dh at each time step. This noise state variables evolve for each tra-
jectory. The horizontal axis is the
accounts for variation in pilot response to advisories and the intruder flight time to collision tcol , and the verti-
path. The plots below use different values of s and show the trajectory sam- cal axis is the altitude relative to the
intruder aircraft h. Appendix E.1
ples produced before finding the first failure with the first failure trajectory provides additional details about
highlighted in red. this problem.

s = 5m s = 3m s = 2m
400

200
h (m)

200

400
40 30 20 10 0 40 30 20 10 0 40 30 20 10 0
tcol (s) tcol (s) tcol (s)

As s decreases, failures become less likely, and more trajectories are required
to find a failure. In this example, the first failure is found after 41 samples
with s = 5 m, 84 samples with s = 3 m, and 522 samples with s = 2 m.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
52 c h ap ter 4 . falsification throug h optimiz ation

where O(s, xo ) is a deterministic function and Do (· | s) is a disturbance distribution.


For example, a disturbance applied to an additive noise sensor controls the amount
of sensor noise added to the true state to produce an observation. Example 4.2
demonstrates this concept for a Gaussian noise sensor model.

Suppose we model a sensor using a Gaussian noise model such that O(o | Example 4.2. Separating the sto-
chastic and deterministic elements
s) = N (o | s, S). We can rewrite this sensor model as of a sensor with a Gaussian noise
model.
o = s + xo , xo ⇠ N (· | 0, S)

We can then define this sensor using the following code:


struct GaussianNoiseSensor <: Sensor
Do # distribution = Do(s)
end
(sensor::GaussianNoiseSensor)(s) = s + rand(sensor.Do(s))
(sensor::GaussianNoiseSensor)(s, xo) = s + xo

In this code, Do represents the nominal disturbance distribution Do ( xo |


s) = N ( xo | 0, S). Since it is a conditional distribution, we represent it as a
function that takes in a state s and outputs a distribution. The disturbance in
this sensor model does not depend on the state, so the function returns the
same distribution regardless of its input. The first function represents the
original sensor model and adds noise sampled from Do to the true state s to
produce an observation. The second function allows us to deterministically
produce an observation for state s given a disturbance xo.

The agent’s policy and the environment’s transition model can also be decom-
posed:
a = p (o, x a ), x a ⇠ Da (· | s) (4.3)
s0 = T (s, a, xs ), xs ⇠ Ds (· | s, a) (4.4)

where p (o, x a ) and T (s, a, xs ) are deterministic functions and Da (· | s) and


Ds (· | s, a) are disturbance distributions. In this textbook, we will wrap these three
components into a single disturbance x and disturbance distribution D. Given
a current state and disturbance distribution, we can sample a disturbance and
produce an observation, action, and next state (algorithm 4.2). For system com-
ponents that are modeled using deterministic functions, applying a disturbance
has no effect.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
4.3. fuz z ing 53

struct Disturbance Algorithm 4.2. Implementation


xa # agent disturbance of a disturbance and disturbance
xs # environment disturbance distribution. The individual distur-
xo # sensor disturbance bance components are used to con-
end trol the agent, environment, and
sensor respectively. Since the com-
struct DisturbanceDistribution ponents of the disturbance dis-
Da # agent disturbance distribution tribution are conditional distribu-
Ds # environment disturbance distribution tions, we assume they are func-
Do # sensor disturbance distribution tions that take in the evidence vari-
end ables and output a sampleable dis-
tribution. Given a current state s
function step(sys::System, s, D::DisturbanceDistribution) and disturbance distribution D, the
xo = rand(D.Do(s)) step function samples a distur-
o = sys.sensor(s, xo) bance and uses it to produce an ob-
xa = rand(D.Da(o))
servation, action, and next state.
a = sys.agent(o, xa)
xs = rand(D.Ds(s, a))
s′ = sys.env(s, a, xs)
x = Disturbance(xa, xs, xo)
return (; o, a, s′, x)
end

4.3 Fuzzing

Unlike direct sampling, which samples from the nominal distribution over system 2
Fuzzing is a well-known concept
trajectories, we can find failures more efficiently by sampling from a trajectory in testing of traditional software.
It refers to the generation of off-
distribution designed to stress the system. We refer to this process as fuzzing.2 nominal inputs to a program to un-
Before we can perform fuzzing, we need to define the components of a trajectory cover potential bugs or failures and
was first introduced in B. P. Miller,
distribution. There are two sources of randomness in a trajectory rollout: the L. Fredriksen, and B. So, “An Em-
initial state and the disturbances applied at each time step. Therefore, we can fully pirical Study of the Reliability of
capture the distribution over trajectories by specifying an initial state distribution UNIX Utilities,” Communications of
the ACM, vol. 33, no. 12, pp. 32–44,
and a disturbance distribution for each time step (algorithm 4.3). 1990.

abstract type TrajectoryDistribution end Algorithm 4.3. Definition of


function initial_state_distribution(p::TrajectoryDistribution) end a trajectory distribution. The
function disturbance_distribution(p::TrajectoryDistribution, t) end initial_state_distribution
function depth(p::TrajectoryDistribution) end function returns the distribu-
tion over initial states. The
disturbance_distribution
function returns the disturbance
In the algorithms presented so far, we have been implicitly sampling from the distribution at time t. The depth
nominal trajectory distribution for a system. We can explicitly construct this distri- function returns the number
bution for a given system using algorithm 4.4. The nominal trajectory distribution of time steps in the trajectories
sampled from the distribution.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
54 c h ap ter 4. falsification throug h optimiz ation

uses the default initial state and disturbance distributions for the components
of the system. Nominal trajectory distributions are stationary, meaning that the
disturbance distribution does not depend on time.

struct NominalTrajectoryDistribution <: TrajectoryDistribution Algorithm 4.4. The nominal tra-


Ps # initial state distribution jectory distribution for a system.
D # disturbance distribution We can construct this distribu-
d # depth tion for a given system sys and
end depth d using the default initial
state and disturbance distributions
function NominalTrajectoryDistribution(sys::System, d) specified by the components of
D = DisturbanceDistribution((o) -> Da(sys.agent, o), the system. Nominal trajectory dis-
(s, a) -> Ds(sys.env, s, a), tributions are stationary, so the
(s) -> Do(sys.sensor, s)) disturbance_distribution func-
return NominalTrajectoryDistribution(Ps(sys.env), D, d) tion returns the same value for any
end time input t.

initial_state_distribution(p::NominalTrajectoryDistribution) = p.Ps
disturbance_distribution(p::NominalTrajectoryDistribution, t) = p.D
depth(p::NominalTrajectoryDistribution) = p.d

We sample trajectories from a trajectory distribution by performing rollouts.


Algorithm 4.5 implements a trajectory rollout given a trajectory distribution. It
returns a trajectory t = (s1 , o1 , a1 , x1 , . . . , sd , od , ad , xd ). If the initial state distribu-
tion and disturbance distributions correspond to the nominal distributions for
the system, algorithm 4.5 performs the same function as algorithm 1.2. However,
algorithm 4.5 also allows us to sample from a different trajectory distribution.
We can use it to perform fuzzing by specifying a trajectory distribution that is
designed to increase the likelihood of sampling failure trajectories. Example 4.3
demonstrates this technique on the inverted pendulum system.

function rollout(sys::System, p::TrajectoryDistribution; d=depth(p)) Algorithm 4.5. A function that per-


s = rand(initial_state_distribution(p)) forms a rollout of a system sys to a
τ = [] depth d given an initial state s and
for t = 1:d trajectory distribution p. It repeat-
o, a, s′, x = step(sys, s, disturbance_distribution(p, t)) edly calls the step function, which
push!(τ, (; s, o, a, x)) steps the system forward in time.
s = s′
end
return τ
end

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
4.3. fuz z ing 55

Suppose we want to find failures of the inverted pendulum system with Example 4.3. Fuzzing applied
to the inverted pendulum system.
an additive noise sensor with Do (o | s) = N (o | 0, S) and S = 0.01I. If The plots in the top row show the
we collect 100 samples with this nominal distribution, we do not find any sampled disturbances for the sen-
failures. However, if we define a new distribution and increase the standard sor noise on each state variable.
deviation of the sensor noise on each variable from 0.1 to 0.15 (referred to The plots on the bottom row show
as fuzzing), we are able to find two failures of the system in the first 100 the corresponding trajectories for
q with failures highlighted in red.
samples. The following code can be used to define the fuzzing distribution: By slightly increasing the standard
struct PendulumFuzzingDistribution <: TrajectoryDistribution deviation of the simulated sensor
Σₒ # sensor disturbance covariance noise, we are able to uncover two
d # depth failures.
end
function initial_state_distribution(p::PendulumFuzzingDistribution)
return Product([Uniform(-π / 16, π / 16), Uniform(-1., 1.)])
end
function disturbance_distribution(p::PendulumFuzzingDistribution, t)
D = DisturbanceDistribution((o)->Deterministic(),
(s,a)->Deterministic(),
(s)->MvNormal(zeros(2), p.Σₒ))
return D
end
depth(p::PendulumFuzzingDistribution) = p.d

The plots show the disturbances and trajectories for both distributions.

Nominal Fuzzing

0.5 0.5
xo,w

xo,w

0 0

0.5 0.5
0.5 0 0.5 0.5 0 0.5
xo,q xo,q

1 1
q (rad)

q (rad)

0 0

1 1
0 0.5 1 1.5 2 0 0.5 1 1.5 2
Time (s) Time (s)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
56 c hap ter 4 . falsification throug h optimiz ation

4.4 Falsification through Optimization

The falsification problem can be reformulated as a search over the space of initial
states and disturbances. Algorithm 4.6 performs a trajectory rollout given an initial
state and a sequence of disturbances. We refer to this sequence of disturbances as
a disturbance trajectory x = ( x1 , . . . , xd ). Unlike algorithm 4.5, algorithm 4.6 is
deterministic. The initial state s and disturbance trajectory x fully determine the
resulting trajectory t.

function step(sys::System, s, x) Algorithm 4.6. A function that per-


o = sys.sensor(s, x.xo) forms a rollout of a system sys to
a = sys.agent(o, x.xa) a depth d given an initial state s
s′ = sys.env(s, a, x.xs) and disturbance trajectory 𝐱. It re-
return (; o, a, s′) peatedly calls the step function,
end which steps the system forward in
time. The step function takes in
function rollout(sys::System, s, 𝐱; d=length(𝐱)) the current state s and disturbance
τ = [] x and deterministically produces
for t in 1:d an observation o from the sensor,
x = 𝐱[t] gets the action a from the agent
o, a, s′ = step(sys, s, x) based on this observation, and de-
push!(τ, (; s, o, a, x)) termines the next state s′ from the
s = s′ environment.
end
return τ
end

To perform falsification in this context, we want to find an initial state s and


disturbance trajectory x that produce a trajectory t such that t 2/ y. Optimization-
based falsification techniques use an objective function to guide this search. An
objective function f (t ) maps a trajectory t to a value related to its level of safety
with respect to y. We can then search for failures by minimizing this objective
over the space of initial states and disturbances as follows

minimize f (t )
s,x
(4.5)
subject to t = Rollout(s, x)

The rest of this chapter discusses different objective functions and optimization
techniques for solving the optimization problem in equation (4.5).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
4.5. objective functions 57

4.5 Objective Functions

Objective functions guide the search for failure trajectories. In general, a good
objective function should output lower values for trajectories that are closer to a
failure. The specific measure of closeness used is dependent on the application.
For example, in the aircraft collision avoidance problem, we may use the vertical
miss distance between the aircraft as the objective value.

4.5.1 Temporal Logic Robustness


If y is specified using a temporal logic formula, we can use its robustness measure
(see section 3.5.2) as an objective function such that f (t ) = r(t, y). Note that
t itself is a function of the initial state and disturbance trajectory, and we can
also write this objective function as f (s, x) = r(Rollout(s, x), y). Algorithm 4.7
implements this objective function given a system and a specification.

function robustness_objective(x, sys, ψ; smoothness=0.0) Algorithm 4.7. Temporal logic ro-


s, 𝐱 = extract(sys.env, x) bustness objective. The function
τ = rollout(sys, s, 𝐱) takes in a vector of real values
𝐬 = [step.s for step in τ] x, a system sys, and a specifica-
return robustness(𝐬, ψ.formula, w=smoothness) tion ψ. It returns the smoothed
end robustness of the resulting trajec-
tory. If smoothness is set to 0, it re-
turns the robustness. The vector
x contains information about the
Since most optimization algorithms operate on a vector of real values, algo-
initial state and disturbances. The
rithm 4.7 takes in a vector of real values containing information about the initial extract function extracts an ini-
state and disturbances. The first step inside the objective function is to extract the tial state and disturbance trajectory
from x and is system specific.
initial state and disturbance trajectories in a way that is system specific. Exam-
ple 4.4 demonstrates this process for the inverted pendulum system. Given these
extracted values, we can perform a rollout of the system using algorithm 4.6 and
compute the corresponding robustness. For optimization algorithms that require
gradients of the objective function, we use the smoothed robustness instead.

4.5.2 Most Likely Failure


The use of an objective function in optimization-based falsification algorithms 3
Another common objective is to
allows us to move beyond a simple search for failures and incorporate other find the most severe failure accord-
ing to a severity metric. There may
objectives into the search. For example, instead of finding any failure, we may also be domain-specific objectives
want to find the most likely failure of a system.3 Determining the most likely such as obeying traffic laws in a
driving scenario.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
58 c hap ter 4. falsification through optimiz ation

Suppose we want to compute the robustness objective for the inverted pendu- Example 4.4. Extracting an ini-
tial state and disturbance trajectory
lum system where the initial state is always s = [0, 0]. We write the extract from a vector of real values for the
function as follows: inverted pendulum system.
function extract(env::InvertedPendulum, x)
s = [0.0, 0.0]
𝐱 = [Disturbance(0, 0, x[i:i+1]) for i in 1:2:length(x)]
return s, 𝐱
end

The function extracts the sensor disturbances from the real-valued vector x
to create a disturbance trajectory 𝐱. It then returns the fixed initial state s and
the disturbance trajectory.

failure requires specifying the distribution over trajectories and using its prob-
ability density function to evaluate likelihoods. Assuming that the initial state
and disturbances are sampled independently from one another, the probability
density function of a trajectory distribution p is

d
p ( t ) = p ( s1 ) ’ D ( x i | s i , a i , o i ) (4.6)
i =1

where D ( x | s, a, o ) = Da ( x a | o ) Ds ( xs | s, a) Do ( xo | s). Algorithm 4.8 imple-


ments equation (4.6).

function Distributions.logpdf(D::DisturbanceDistribution, s, o, a, x) Algorithm 4.8. Probability density


logp_xa = logpdf(D.Da(o), x.xa) function of a trajectory distribution
logp_xs = logpdf(D.Ds(s, a), x.xs) p. We perform computations in log
logp_xo = logpdf(D.Do(s), x.xo) space for numerical stability. We
return logp_xa + logp_xs + logp_xo first compute the log likelihood of
end the initial state according the initial
state distribution. We then add the
function Distributions.pdf(p::TrajectoryDistribution, τ) log likelihood of each disturbance
logprob = logpdf(initial_state_distribution(p), τ[1].s) in the trajectory. The first function
for (t, step) in enumerate(τ) evaluates the log likelihood of a dis-
s, o, a, x = step turbance given a disturbance dis-
logprob += logpdf(disturbance_distribution(p, t), s, o, a, x) tribution D and the evidence vari-
end ables.
return exp(logprob)
end

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
4.5. objective functions 59

function likelihood_objective(x, sys, ψ; smoothness=0.0) Algorithm 4.9. Objective function


s, 𝐱 = extract(sys.env, x) for finding the most likely failure.
τ = rollout(sys, s, 𝐱) The function takes in a vector of
if isfailure(ψ, τ) real values x, a system sys, and
p = NominalTrajectoryDistribution(sys, length(𝐱)) a specification ψ. If the resulting
return -pdf(p, τ) trajectory is a failure, it returns
else the negative likelihood of the tra-
𝐬 = [step.s for step in τ] jectory under the nominal trajec-
return robustness(𝐬, ψ.formula, w=smoothness) tory distribution p. Otherwise, it
end returns the smoothed robustness
end of the trajectory (or the robustness
if smoothness is set to 0).

Given a trajectory distribution p, we define the most likely failure objective


(algorithm 4.9) as follows
8
<r(t, y) if t 2
/y
f (t ) = (4.7)
: p(t ) otherwise

If the input trajectory does not produce a failure, equation (4.7) uses the robust-
ness to guide the search toward any failure. If the input does produce a failure
trajectory, it uses the negative likelihood of the trajectory to guide the search
toward more likely failures. Figure 4.3 compares a search for failures with a
search for the most likely failure on the grid world problem. While the robustness
objective finds failures that move directly toward the obstacle, the most likely
failure objective finds a failure that stays close to the nominal path.
The objective function in equation (4.7) leads to multiple practical challenges.
For example, to encourage the optimization algorithm to find failures, we must
ensure that failures never have a higher objective value than successes. Since
r(t, y) 0 and p(t )  0, equation (4.7) satisfies this condition. However, p(t )
can be very small for long trajectories, which can lead to numerical stability issues.
Using log likelihood improves numerical stability but breaks the condition that
failures never have a higher objective value than successes.
This numerical instability as well as the discontinuity at the point of a failure cre-
ates challenges for first- and second-order optimization algorithms (section 4.6).
Furthermore, while the global minimum of the objective function in equation (4.7)
corresponds to the most likely failure of the system, many optimizers are only
guaranteed to find local minima. Due to this fact and the numerical stability

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
60 c h ap ter 4 . falsification through optimiz ation

Iteration 1 Iteration 5 Iteration 8 Converged Figure 4.3. A comparison of


optimization-based falsification on
the grid world using the robust-
ness objective and the likelihood
Robustness

objective. The plots show the pro-


gression of a population-based op-
timization algorithm, which will
be discussed in the next section.
The shaded gray path on the plots
in the final column represents the
most likely path for the system. The
robustness objective finds failures
that quickly move towards the ob-
Likelihood

stacle, while the most likely failure


objective find a failure that stays
close to the nominal path, only
veering toward the obstacle at the
end.

issues, other objective functions may lead to the discovery of more likely failures
in practice.
Another common objective for most likely failure analysis is

f (t ) = r(t, y) l log( p(t )) (4.8)

where l is a weighting parameter selected by the user (algorithm 4.10). This


objective is smooth and encourages the optimization algorithm to search simulta-
neously for trajectories that are both likely and close to failure.

function weighted_likelihood_objective(sys, ψ; smoothness=0.0, λ=1.0) Algorithm 4.10. Objective function


s, 𝐱 = extract(sys.env, x) that weights the tradeoff between
τ = rollout(sys, s, 𝐱) robustness and likelihood. The
𝐬 = [step.s for step in τ] function takes in a vector of real val-
p = NominalTrajectoryDistribution(sys, length(𝐱)) ues x, a system sys, and a specifica-
return robustness(𝐬, ψ.formula, w=smoothness) - λ * log(pdf(p, τ)) tion ψ. It returns a weighted combi-
end nation of the smoothed robustness
(or the robustness if smoothness
is set to 0) and the negative likeli-
hood under the nominal trajectory
distribution p.
4.6 Optimization Algorithms
4
M. J. Kochenderfer and T. A.
We can search for failures by applying a variety of optimization algorithms Wheeler, Algorithms for Optimiza-
tion. MIT Press, 2019.
to the optimization problem in equation (4.5).4 Algorithm 4.11 implements

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
4.6. optimiz ation algorithms 61

struct OptimizationBasedFalsification Algorithm 4.11. The optimization-


objective # objective function based falsification algorithm for
optimizer # optimization algorithm finding failures. The algorithm first
end computes a system-specific objec-
tive function f from a generic objec-
function falsify(alg::OptimizationBasedFalsification, sys, ψ) tive function objective (as speci-
f(x) = alg.objective(x, sys, ψ) fied in section 4.5). It then runs the
return alg.optimizer(f, sys, ψ) optimizer and returns the results.
end

optimization-based falsification given an objective and optimization algorithm. It


computes the system-specific objective function f , runs the optimizer, and returns
its output. Example 4.5 applies algorithm 4.11 to find failures in the inverted
pendulum problem using an off-the-shelf optimization package.5 The choice of 5
Off-the-shelf optimization pack-
optimization algorithm depends on the complexity of the system under test and ages provide implementations of a
variety of optimization algorithms.
the level of access to the system’s internal model. The rest of this section outlines One such package in the Julia
several categories of optimization algorithms and compares their advantages ecosystem is Optim.jl.
and disadvantages in the context of falsification. See appendix C for an in depth
review of the algorithms. 1

One category of optimization techniques is local descent methods. Local descent

q (rad)
methods start from an initial design point and incrementally improve it until 0
some convergence criteria is met. At each iteration, they use a local model of
the objective function at the current design point to determine a direction of 1
improvement. They then take a step in this direction to compute the next design 0 0.2 0.4 0.6 0.8 1
point. Some methods use the gradient or Hessian of the objective function with Time (s)

respect to the current design point to create the local model. These methods are
Figure 4.4. First-order method ap-
called first-order and second-order methods, respectively. Figure 4.4 shows the plied to falsify the inverted pendu-
result of applying a first-order method called gradient descent to find failures for lum example. The plot shows suc-
cessive iterations of the algorithm,
the inverted pendulum example. with darker trajectories indicating
While the gradient and Hessian provide a very powerful signal for optimiza- later iterations. Failures are high-
tion algorithms, they are not always available. 6 Some simulators do not provide lighted in red. The algorithm gets
closer to a failure with each iter-
access to the internal model of the system, making exact computation of the ation until it eventually begins to
gradient infeasible. We often refer to such simulators as black-box simulators. An- find failures.
other category of optimization algorithms called direct methods is better suited
6
Gradient information is a strong
for systems with black-box simulators. They traverse the input space using only enough signal to effectively op-
information from function evaluations, eliminating the need for access to the timize machine learning models
system’s internal model. with billions of parameters.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
62 c h ap ter 4 . falsification throug h optimiz ation

The Optim.jl package provides implementations of several optimization al- Example 4.5. Applying a second-
order method called L-BFGS to
gorithms. In this example, we show how to use the Optim.jl implementation falsify the inverted pendulum ex-
of a second-order method called L-BFGS to falsify the inverted pendulum ample. We use the open-source
system. We define the optimizer function for algorithm 4.11 and run the implementation of L-BFGS in the
algorithm using the robustness objective as follows: Optim.jl package. The plot shows
the trajectory of the pendulum
using Optim
for the initial point (green) and
function lbfgs(f, sys, ψ)
the failure trajectory discovered af-
x₀ = zeros(42)
alg = Optim.LBFGS() ter one iteration (red). For more
options = Optim.Options(store_trace=true, extended_trace=true) information on the L-BFGS algo-
results = optimize(f, x₀, alg, options; autodiff=:forward) rithm, see J. Nocedal, “Updating
τs = [rollout(sys, extract(sys.env, iter.metadata["x"])...) Quasi-Newton Matrices with Lim-
for iter in results.trace] ited Storage,” Mathematics of Com-
return filter(τ->isfailure(ψ, τ), τs) putation, vol. 35, no. 151, pp. 773–
end 782, 1980.
objective(x, sys, ψ) = robustness_objective(x, sys, ψ, smoothed=true)
alg = OptimizationBasedFalsification(objective, lbfgs) 1
failures = falsify(alg, inverted_pendulum, ψ) Failure
Trajectory

q (rad)
In this implementation, we are optimizing over a disturbance trajectory with 0
Initial
depth d = 21. Since each sensor disturbance is two-dimensional, the length of Trajectory
each design point is 42. The lbfgs function starts with an initial design point
1
of all zeros, specifies options to store the results of each iteration, and runs 0 0.2 0.4 0.6 0.8 1
the algorithm using ForwardDiff.jl to compute gradients. It then extracts Time (s)
the initial state and disturbance trajectory from each iteration and performs
a rollout of the system. Finally, it filters the resulting trajectories to return
failure trajectories. It is important that we specify the objective as smoothed
robustness so that the gradients are well-defined. The plot on the right shows
the progression of the algorithm. L-BFGS converges to a failure trajectory
after a single iteration.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
4.7. summary 63

Local descent methods often get stuck in local optima. Population methods at-
tempt to overcome this drawback by performing optimization using a collection of
design points. The points in a population are sometimes referred to as individuals.
Population methods begin with an initial population that is spread out over the
design space. At each iteration, they use the current function value of each indi-
vidual to move the population toward the optimum. Because population methods
spread samples over the entire design space rather than incrementally improving
a single point, they may find a more diverse set of failures. For example, the
population method in figure 4.5 is able to find failures for the pendulum in both
directions. High-dimensional problems with long time horizons may require a
large number of samples to cover the design space. However, population methods
are often easy to parallelize, which can improve efficiency.

Iteration 1 Iteration 5 Iteration 10 Figure 4.5. Population method ap-


plied to falsify the inverted pendu-
1 lum example. The plots shows the
trajectories for the individuals in
the population at three iterations.
q (rad)

0 Failures are highlighted in red. The


individuals in the population get
closer to a failure with each itera-
1 tion, and the algorithm finds trajec-
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 tories that fail in both the positive
Time (s) Time (s) Time (s) and negative direction.

4.7 Summary

• Monte Carlo falsification requires 1/pfail samples on average to find a failure,


which can be computationally expensive for systems with rare failure events.

• Optimization-based falsification algorithms use optimization techniques to


find failures more efficiently.

• Disturbances are a useful concept for optimization-based falsification algo-


rithms, and we can reformulate the distribution over trajectories for a system
as a distribution over initial states and disturbances.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
64 c h ap ter 4. falsification throug h optimiz ation

• We can formulate the falsification problem as an optimization problem by


defining an objective function and optimizing over initial states and distur-
bances.

• We can apply a variety of optimization algorithms to search for failures of a


system, and the choice of algorithm depends on the problem complexity and
the availability of the system’s internal model.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
5 Falsification through Planning

The methods in the previous chapter find counterexamples by performing op-


timization over full trajectories. In many cases, we can increase efficiency by
considering a sequence of partial trajectories. In particular, this chapter discusses
methods that use planning algorithms to account for the temporal aspect of the
problem. Planning techniques break the falsification problem into a sequence of
smaller problems. We discuss several categories of planning algorithms that rely
on optimization, search, and reinforcement learning.

5.1 Shooting Methods

Shooting methods use optimization to find a feasible path between two points,1 1
The term shooting method is
and they can be used in the context of falsification to produce feasible failure based on the analogy of shooting
at a target from a cannon. Shoot-
trajectories. These methods break the trajectory optimization problem into a set ing methods start at an initial point
of smaller problems by optimizing over a sequence of trajectory segments. A and ‘‘shoot’’ trajectories toward a
target point until a feasible path
trajectory t can be partitioned into n segments such that t = (t1 , . . . , tn ). Each between the initial point and tar-
trajectory segment ti is defined by an initial state si and a sequence of disturbances get is found. Shooting methods
xi of length di . Given si and xi , we can compute the resulting trajectory ti by orginated from research on bound-
ary value problems. A more de-
performing a rollout. tailed review with an implemena-
The defect between two trajectory segments is the distance between the final tion can be found in section 18.1 of
the reference by W. H. Press, S. A.
state of the first segment and the initial state of the second segment. A set of Teukolsky, W. T. Vetterling, and B. P.
trajectory segments forms a feasible trajectory if the defect of all consecutive Flannery, Numerical Recipes 3rd Edi-
trajectory segments is 0. In other words, the final state of ti must match the initial tion: The Art of Scientific Computing.
Cambridge University Press, 2007.
state of ti+1 for all i 2 {1, . . . , n 1}. This requirement leads to the following
66 c h ap ter 5. falsification throug h planning

optimization problem
minimize f (t1 , . . . , tn )
s1 ,x1 ,...,sn ,xn

subject to ti = Rollout(si , xi ) for all i 2 {1, . . . , n} (5.1)

Defect(ti , ti+1 ) = 0 for all i 2 {1, . . . , n 1}


where f is the falsification objective (see section 4.5). If n = 1, the optimization
problem is equivalent to optimizing over the entire trajectory, and equation (5.1)
reduces to equation (4.5). This process is referred to as single shooting. For n > 1,
this process is referred to as multiple shooting.
Multiple shooting seemingly increases the complexity of the optimization
problem by adding more variables and constraints, but it can actually improve
efficiency, especially for systems in which small changes in the inputs applied at
the beginning of a trajectory have a significant effect on the end of the trajectory.
For example, consider the problem of finding a path through a maze where one
wrong turn at the beginning of the trajectory could ultimately lead to a dead end.
If we use single shooting, we must optimize over the entire path at once. If we use
multiple shooting, we can break the path into segments that focus on different
regions of the maze.

defect(τᵢ, τᵢ₊₁) = norm(τᵢ₊₁[1].s - τᵢ[end].s) Algorithm 5.1. Temporal logic ro-


bustness objective for multiple
function shooting_robustness(x, sys, ψ; smoothness=0.0, λ=1.0) shooting. The function takes in a
segments = extract(sys.env, x) vector of real values x, a system sys,
n = length(segments) and a specification ψ and returns
τ_segments = [rollout(sys, seg.s, seg.𝐱) for seg in segments] the objective in equation (5.2). The
τ = vcat(τ_segments...) smoothness parameter controls the
𝐬 = [step.s for step in τ] smoothness of the robustness func-
ρ = smooth_robustness(𝐬, ψ.formula, w=smoothness) tion, and λ controls the weighting
defects = [defect(τ_segments[i], τ_segments[i+1]) for i in 1:n-1] of the defect penalty. The defect
return ρ + λ*sum(defects) function computes the defect be-
end tween two trajectory segments. The
extract function extracts the tra-
jectory segments from the vector x
The constraint on the defect of consecutive trajectory segments in equation (5.1) and is system specific.
poses challenges for many optimization algorithms. In practice, we instead incor-
porate it as a soft constraint by adding it as a penalty in the objective:
n 1
minimize
s1 ,x1 ,...,sn ,xn
f (t1 , . . . , tn ) + l  Defect(ti , ti+1 ) (5.2)
i =1
subject to ti = Rollout(si , xi ) for all i 2 {1, . . . , n}

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
5.2. tree search 67

Iteration 2 Iteration 4 Iteration 6 Iteration 8 Converged Figure 5.1. Multiple shooting ap-
plied to the continuum world ex-
ample to find a path from an initial
point to the obstacle. We use four
trajectory segments, and the colors
denote which segment end points
should connect. The plots show the
trajectory segments at different iter-
ations of the L-BFGS optimization
algorithm.

where l is a weighting parameter that controls how heavily the defect is penalized.
Algorithm 5.1 implements this objective when f is the temporal logic robustness.
We can apply any of the optimization algorithms discussed in appendix C
to the optimization problem in equation (5.2). Compared to the optimization
problems in the previous chapter, minimizing the defect between the trajectory
segments adds complexity to the problem. This added complexity can make it
more difficult to find a feasible failure trajectory. Figure 5.1 shows an example that
uses a gradient-based optimization technique called L-BFGS2 to find a failure tra- 2
J. Nocedal, “Updating Quasi-
jectory for the continuum world problem. For systems with black-box simulators, Newton Matrices with Limited
Storage,” Mathematics of Computa-
the direct methods described in appendix C may struggle to find feasible failure tion, vol. 35, no. 151, pp. 773–782,
trajectories. Instead, we can use direct methods that were designed specifically 1980.
for multiple shooting.3 3
For an example of a multiple
shooting algorithm designed for
systems with black-box simulators,
5.2 Tree Search see A. Zutshi, J. V. Deshmukh, S.
Sankaranarayanan, and J. Kapinski,
“Multiple Shooting, CEGAR-Based
Tree search algorithms iteratively construct a tree structure that represents the Falsification for Hybrid Systems,”
space of possible trajectories. Each node in the tree represents a state, and each in International Conference on Embed-
ded Software, 2014.
edge represents a transition between states that is the result of applying a partic-
ular disturbance. Each path through the tree corresponds to a feasible trajectory
for the system. Tree search algorithms start in an initial state and iteratively grow
the tree in an attempt to find feasible failure trajectories.
At each iteration, these algorithms perform the steps illustrated in figure 5.2.
They first select a node from the tree to extend. This selection is typically based
on a heuristic designed to grow the tree toward failures. Next, they extend the
selected node by choosing a disturbance and adding a new child node at the
resulting next state. We can terminate the algorithm after a fixed number of
iterations or when a failure trajectory is discovered.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
68 c h ap ter 5. falsification throug h planning

Algorithm 5.2 implements the generic tree search algorithm. It runs for a fixed
number of iterations before returning all failures in the tree. Algorithm 5.3 extracts
failure trajectories from a tree by enumerating all paths in the tree and checking
for failures. Specific implementations of tree search algorithms differ in how they
implement the select and extend functions. We discuss two categories of tree
search algorithms in the next two sections.

Current Tree Select Extend Figure 5.2. One iteration of tree


search. The nodes of the tree repre-
sent states s. Given a disturbance x,
s s s we produce an observation o and
an action a that lead us to the next
node. The edges represent these
o, a, x o, a, x o, a, x o, a, x o, a, x o, a, x transitions. The algorithm first se-
lects a node from the tree to extend.
It then chooses a disturbance to ex-
s s s s s s tend the selected node and adds
the resulting next state as a new
o, a, x o, a, x o, a, x o, a, x child.

s s s s

abstract type TreeSearch end Algorithm 5.2. Generic tree search


algorithm for finding failure trajec-
function falsify(alg::TreeSearch, sys, ψ) tories. The algorithm starts by ini-
tree = initialize_tree(alg, sys) tializing a tree with a single node. It
for i in 1:alg.k_max then iteratively selects a node from
node = select(alg, sys, ψ, tree) the tree using the select function
extend!(alg, sys, ψ, tree, node) and adds to its children using the
end extend! function. After k_max iter-
return failures(tree, sys, ψ) ations, the algorithm returns the
end set of failure trajectories in the tree.

5.3 Heuristic Search

Some tree search algorithms use heuristics to explore the space of possible tra-
jectories. The rapidly exploring random trees (RRT) algorithm, for example, uses
heuristics to iteratively extend the search tree toward randomly selected states in

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
5.3. heuristic search 69

function trajectory(node) Algorithm 5.3. Functions for


τ = [] extracting failure trajectories from
while !isnothing(node.parent) a tree. The failures function
pushfirst!(τ, (s=node.parent.state, node.edge...)) first finds the leaves of the tree
node = node.parent and extracts the corresponding
end trajectory from each leaf using
return τ the trajectory function. The
end trajectory function starts at
a leaf node and propagates
function failures(tree, sys, ψ) backward through the tree to
leaves = filter(node -> isempty(node.children), tree) construct the full trajectory. The
τs = [trajectory(node) for node in leaves] failures function then filters
return filter(τ -> isfailure(ψ, τ), τs) these trajectories for failures and
end returns the result.

the state space.4 In the context of falsification, we use RRT to efficiently explore 4
RRT was designed to efficiently
the space of possible disturbance trajectories in search of a failure trajectory. enumerate trajectories in high-
dimensional spaces, particularly
Algorithm 5.4 implements the select and extend steps for the RRT algorithm. for systems with complex dynam-
In the select step, RRT randomly samples a goal state and computes an objective ics. The algorithm was originally
proposed in the context of robotic
value for each node in the current tree based on the sampled goal state. This path planning. For more informa-
objective is typically related to the distance between each node and the goal state. tion on path planning algorithms,
The algorithm then selects the node with the lowest objective value to pass to see S. LaValle, “Planning Algo-
rithms,” Cambridge University Press,
the extend step. In the extend step, RRT selects a disturbance, simulates one step vol. 2, pp. 3671–3678, 2006.
forward in time from the selected node, and adds the resulting edge and child
node to the tree.
Several variants of RRT differ in how they sample goal states, compute objec-
tives, and select disturbances. Algorithm 5.5 implements a version of the RRT
algorithm that samples goal states uniformly from the state space. It then uses
the Euclidean distance between the each node and the goal state as the objective.
In the extend step, the disturbance is randomly sampled from the nominal dis-
turbance distribution for the system. Example 5.1 applies this algorithm to the
continuum world problem.

5.3.1 Goal Heuristics


Algorithm 5.5 uses the sampled goal state to select which node in the tree to
extend, but it does not use the goal state when selecting the disturbance used to
extend the selected node. We can improve the performance of the algorithm by

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
70 c h ap ter 5. falsification throug h planning

struct RRT <: TreeSearch Algorithm 5.4. The rapidly


sample_goal # sgoal = sample_goal(tree) exploring random trees algorithm.
compute_objectives # objectives = compute_objectives(tree, sgoal) The algorithm is a type of tree
select_disturbance # x = select_disturbance(sys, node) search algorithm and implements
k_max # number of iterations both the select and extend!
end functions. The select function
samples a goal state according
mutable struct RRTNode to the sample_goal function and
state # node state computes an objective value for
parent # parent node each node in the tree using the
edge # (o, a, x) compute_objectives function.
children # vector of child nodes It then selects the node with
goal_state # current goal state the lowest objective value, sets
end its goal state, and returns it.
The extend! function selects a
function initialize_tree(alg::RRT, sys)
disturbance according to the
return [RRTNode(rand(Ps(sys.env)), nothing, nothing, [], nothing)]
select_disturbance function,
end
simulates one step forward in time,
and adds the results to the tree in
function select(alg::RRT, sys, ψ, tree)
the form of a new child node.
sgoal = alg.sample_goal(tree)
objectives = alg.compute_objectives(tree, sgoal)
node = tree[argmin(objectives)]
node.goal_state = sgoal
return node
end

function extend!(alg::RRT, sys, ψ, tree, node)


x = alg.select_disturbance(sys, node)
o, a, s′ = step(sys, node.state, x)
snew = RRTNode(s′, node, (; o, a, x), [], nothing)
push!(node.children, snew)
push!(tree, snew)
end

random_goal(tree, lo, hi) = rand.(Distributions.Uniform.(lo, hi)) Algorithm 5.5. Functions for the
RRT algorithm. The first function
function distance_objectives(tree, sgoal) samples a goal state uniformly
return [norm(sgoal .- node.state) for node in tree] from the state space. The lo and
end hi inputs specify the lower and up-
per bounds of the state variables.
function random_disturbance(sys, node) The second function computes the
D = DisturbanceDistribution(sys) Euclidean distance between each
o, a, s′, x = step(sys, node.state, D) node in the tree and the goal state.
return x The third function samples a dis-
end turbance from the nominal distur-
bance distribution for the system.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
5.3. heuristic search 71

Suppose we want to apply RRT to search for failures for the continuum world Example 5.1. Basic RRT applied to
the continuum world example. The
system. We can use the following code to run the basic RRT algorithm for plots show snapshots of the search
100 iterations. tree after 5, 15, and 100 iterations.
select_goal(tree) = random_goal(tree, [0.0, 0.0], [10.0, 10.0]) The stars show the next goal state
compute_objectives(tree, sgoal) = distance_objectives(tree, sgoal) and highlighted nodes show the
select_disturbance(tree, node) = random_disturbance(tree, node) node selected to extend next.
alg = RRT(select_goal, compute_objectives, select_disturbance, 100)
failures = falsify(alg, cw, ψ)

The plots below show two snapshots of the search tree after 5 and 15 iterations
as well as the final tree after 100 iterations. After 100 iterations, RRT did not
find any failure trajectories. Although goal states are sampled throughout
the state space, the disturbances are sampled from the nominal disturbance
distribution. Since the nominal disturbance distribution represents only small
deviations from the nominal path, the tree closely follows the nominal path
toward the goal. We can improve the performance of the tree search using
the heuristics discussed in section 5.3.1.

Iteration 5 Iteration 15 Iteration 100

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
72 c hap ter 5. falsification throug h planning

using the goal state to select the disturbance. Specifically, we want to select the
disturbance that leads to the next state that is closest to the goal state.

function goal_disturbance(sys, node; m=10) Algorithm 5.6. Function for select-


D = DisturbanceDistribution(sys) ing a disturbance that leads to the
steps = [step(sys, node.state, D) for i in 1:m] next state that is closest to the goal
distances = [norm(node.goal_state - step.s′) for step in steps] state. The algorithm take m steps
return steps[argmin(distances)].x using the nominal disturbance dis-
end tribution. It then computes the dis-
tances between the next state from
each step and the goal state. It re-
Algorithm 5.6 uses sampling to search for a disturbance that results in a next turns the disturbance that resulted
in the lowest distance.
state that is close to the goal state. It draws m samples from the nominal distur-
bance distribution and simulates one step forward in time from the current node
using each sample. It then returns the disturbance that results in the next state 5
To improve performance and ef-
that is closest to the goal state. As m increases, the performance of the algorithm ficiency, more sophisticated opti-
mization algorithms can also be
improves but at a greater computational cost.5 Figure 5.3 demonstates this process used (see appendix C).
on one step of RRT for the continuum world problem.

Current Tree Select Extend Figure 5.3. One iteration of RRT


applied to the continuum world
example using algorithm 5.6 with
m = 10 to select the disturbance
in the extend step. The algorithm
selects the node that is closest to
the goal state to extend and sam-
ples 10 disturbances to add to the
tree. It then selects the disturbance
that results in the next state that is
closest to the goal state and adds
the resulting edge and child node
to the tree.

In addition to improving the extend step of algorithm 5.5, we can improve


the select step by modifying how we sample the goal state. Instead of sampling
the goal state uniformly from the state space, we can use a heuristic to sample
goal states that are more likely to grow the tree toward failures. One technique
is to identify a failure region in the state space and sample goal states from this
region. A failure region is a region such that any trajectory that passes through
this region is a failure trajectory. For example, the failure region in the continuum
world problem is the set of states within the red obstacle. This technique is limited
to systems with specifications that depend only on the state. For specifications of

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
5.3. heuristic search 73

The failure region for the continuum world example is the set of states within Example 5.2. Example of sampling
goal states from the failure region
the red obstacle, which is a circle centered at (4.5, 4.5) with radius 0.5. We of the continuum world problem.
can uniformly sample from this region using the following code:
function failure_goal(tree)
r = rand(Uniform(0, 0.5))
θ = rand(Uniform(0, 2π))
return [4.5, 4.5] .+ [r*cos(θ), r*sin(θ)]
end

The code uniformly samples a radius between 0 and 5 and an angle between
0 and 2p. It then converts these samples to a state in the failure region.

temporal properties, identifying a failure region is not possible without augment-


ing the state space. Figure 5.4 shows the result of using this heuristic along with
algorithm 5.6 to apply RRT to the continuum world problem.

5.3.2 Coverage Heuristics


To uncover a variety of ways in which a system might fail, it is important that
Figure 5.4. RRT applied to the con-
we explore a diverse set of trajectories. We incorporate this idea into RRT using tinuum world problem using algo-
heuristics that are designed to maximize coverage of the state space. We assess rithm 5.6 with m = 10 to select the
disturbance in the extend step and
coverage using coverage metrics, which measure how well a set of samples fill a goal states sampled from the fail-
given space. In the context of tree search, we are interested in how well the states ure region. The algorithm was run
for 100 iterations and discovered
represented by the nodes in the tree fill the state space. We can then use these
the failure trajectories highlighted
coverage metrics in the select step to select the next goal state. in red.
One common coverage metric is related to the concept of dispersion. The dis-
persion of a set of points V in the bounded region S is the radius of the largest
ball that can be placed in S such that no point in V lies within the ball, written as
✓ ◆
dispersion = sup min ks si k (5.3)
s2S si 2V
Figure 5.5. Visualization of dis-
persion for two different sets of 10
where the outer optimization represents the supremum. A supremum is a general-
points. The blue set does not fill
ization of a maximum that allows solutions to exist when the largest ball merely the space as well as the green set.
approaches a particular size before containing one of the points in V . The norm We can find a larger ball that does
not contain any points in the blue
in equation (5.3) can be any norm. A common choice is the `2 -norm. Coverage is set than we can for the green set.
inversely related to dispersion. In other words, a set of points with high dispersion Therefore, the blue set has higher
dispersion and lower coverage.
will have low coverage (see figure 5.5).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
74 c h ap ter 5. falsification through planning

Since dispersion considers only the largest ball that can be placed in S , it tends
to be a conservative measure of coverage. Furthermore, it is difficult to compute
for high-dimensional spaces. An approximate metric called average dispersion
overcomes these drawbacks.6 Average dispersion is computed on a grid of n 6
J. M. Esposito, J. Kim, and V. Ku-
points with spacing d in each dimension. It is calculated as mar, “Adaptive RRTs for Validat-
ing Hybrid Robotic Control Sys-
n min(d j (V ), d) tems,” in Algorithmic Foundations
1
average dispersion = Â (5.4) of Robotics, Springer, 2005, pp. 107–
n d
j =1 121.

where d j (V ) is the distance from the jth grid point to the nearest point in V .

function average_dispersion(points, lo, hi, lengths) Algorithm 5.7. Algorithm for com-
points_norm = [(point .- lo) ./ (hi .- lo) for point in points] puting average dispersion of a set
ranges = [range(0, 1, length) for length in lengths] of points on a space bounded by
δ = minimum(Float64(r.step) for r in ranges) lo and hi. It uses a grid speci-
grid_dispersions = [] fied by lengths, which contains
for grid_point in Iterators.product(ranges...) the number of grid points in each
dmin = minimum(norm(grid_point .- p) for p in points_norm) dimension. The algorithm first nor-
push!(grid_dispersions, min(dmin, δ) / δ) malizes the points to lie in the unit
end hypercube. It then creates the grid
return mean(grid_dispersions) over the unit hypercube and com-
end putes the average dispersion using
equation (5.4).

Algorithm 5.7 computes average dispersion given a set of points and a bounded
region. The term in the numerator of equation (5.4) is the radius of the largest
ball centered at each grid point that does not contain any points in V or other grid
points. Dividing by d ensures that the values for average dispersion range between
0 and 1, and subtracting the average dispersion from 1 results in a coverage metric 7
The average dispersion coverage
that ranges between 0 and 1.7 Figure 5.6 shows the difference between dispersion metric will be 1 if V contains all of
the grid points. A finer grid will
and average dispersion. result in better coverage estimates
Another common coverage metric is discrepancy. The key insight behind dis- but at a greater computational cost.
crepancy is that if a set of points covers a space evenly, then a randomly chosen
subset of the space should contain a fraction of samples proportional to the frac-
tion of volume occupied by the subset. Discrepancy is defined as the worst case
hyperrectangular subset:

#(V \ H) vol(H)
discrepancy = sup (5.5)
H✓S #(V ) vol(S)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
5.3. heuristic search 75

Dispersion Average Dispersion Figure 5.6. Visualization of the


difference between dispersion and
average dispersion on a set of
points. While dispersion finds the
largest ball that does not contain
any points in the set, average dis-
persion operates on a grid. The
gray dots indicate grid points, and
the cirles show the largest ball that
does not contain any grid points or
points in the set. The average dis-
persion is the average of the radii
of these circles normalized by the
grid spacing.

where H is a hyperrectangular subset of S and #(V \ H) and #(V ) are the number
of points in V that lie in H and the total number of points in V respectively. We use
vol(H) and vol(S) to denote the n-dimensional volume of H and S respectively,
which can be obtained by multiplying the side lengths.
The worst-case hyperrectangle that determines the discrepancy of a set of
points is typically a small region containing many points or a large region with
few points. Figure 5.7 visualizes the discrepancy metric. Discrepancy approaches
1 when all points overlap and approaches 0 when all possible hyperrectangular Figure 5.7. Visualization of the
discrepancy metric. The rectangles
subsets have their proper share of points. In general, discrepancy is difficult to indicate two candidates for the
compute exactly, especially in high dimensions. worst case rectangle used to define
discepancy. Discrepancy is deter-
Star discrepancy is a special case of discrepancy that is easier to compute and is mined by a rectangle with small
often used in practice. Instead of considering all possible hyperrectangular subsets, area and many points (top) or a
star discrepancy considers only hyperrectangular subsets of the unit hypercube rectangle with large area and few
points (bottom).
that have a vertex at the origin. We can always normalize any hyperrectangular
space S to the unit hypercube by dividing by the side length in each dimension. 8
E. Thiémard, “An Algorithm to
Compute Bounds for the Star Dis-
Given these constraints, it is possible to compute lower and upper bounds on
crepancy,” Journal of Complexity,
star discrepancy.8 We first partition the unit hypercube B into a finite number of vol. 17, no. 4, pp. 850–880, 2001.
Examples of other approximations
can be found in Y.-D. Zhou, K.-T.
Fang, and J.-H. Ning, “Mixture Dis-
crepancy for Quasi-Random Point
Sets,” Journal of Complexity, vol. 29,
no. 3-4, pp. 283–301, 2013.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
76 c h ap ter 5. falsification through planning

subrectangles h 2 P. We then compute the bounds as


✓ ✓ ◆◆
#(V \ h+ ) vol(h ) vol(h+ ) #(V \ h )
upper = max max ,
h2P #(V ) vol(B) vol(B) #(V )
✓ ✓ ◆◆
#(V \ h ) vol(h ) vol(h ) #(V \ h+ )
+
lower = max max ,
h2P #(V ) vol(B) vol(B) #(V ) h+
(5.6)
h h
where h+ and h are hyperrectangular subsets derived from subrectangle h as
shown in figure 5.8.
The tightness of the upper and lower bounds in equation (5.6) depends on
the resolution of the partition. Finer partitions will lead to tighter bounds at a
greater computional cost. Algorithm 5.8 computes upper and lower bounds on Figure 5.8. Visualization of the hy-
perrectangular subsets for subrect-
star discrepancy using equation (5.6) given a set of points and a bounded region.
angle h used to compute upper and
We can subtract the value of star discrepancy from 1 to provide a coverage metric lower bounds on star discrepancy.
that ranges between 0 and 1. Figure 5.9 shows the upper and lower bounds on star
discrepancy for the sets of points in figure 5.5 as the resolution of the partition is
1
increased.
We can use average dispersion or star discrepancy as a metric to select the goal

Star Discrepancy
0.8

state in RRT. In particular, we want to select the goal state that would result in 0.6
the greatest increase in coverage if added to the current tree. While it is difficult 0.4
to determine the goal state exactly, we can approximate this process by drawing 0.2
samples from the state space, computing the difference in coverage for each
0
sample, and selecting the sample with the largest increase.9 The samples may 20 40 60 80 100
be selected from a grid (figure 5.10) or drawn uniformly from the state space Grid Resolution

(example 5.3). Figure 5.9. Upper and lower


Coverage metrics can also be used as termination conditions. It is not always bounds on star discrepancy for the
sets of points in figure 5.5. The grid
clear when to terminate tree search algorithms, especially if no failures are found.
resolution is the number of grid
One option is to terminate the search when state space coverage is sufficient. Since points in each dimension. The up-
not all states are necessarily reachable, coverage will not necessarily approach per and lower bounds approach
each other as the grid resolution
1 as the number of tree search iterations increases. We therefore cannot use the increases. The green set of points
magnitude of coverage as a termination condition by itself. Instead, we compute is more evenly distributed than the
blue set, so it has lower star discrep-
a growth metric such as
ancy.
Coverage(V ’) Coverage(V ) 9
High-dimensional problems may
growth = (5.7) require more sophisticated tech-
#(V 0 ) #(V ) niques. T. Dang and T. Nahhal,
“Coverage-Guided Test Generation
where V and V 0 are the sets of points in the tree at the beginning and end of the for Continuous and Hybrid Sys-
current iteration. tems,” Formal Methods in System De-
sign, vol. 34, pp. 183–213, 2009.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
5.3. heuristic search 77

function star_discrepancy(points, lo, hi, lengths) Algorithm 5.8. Algorithm for com-
n, dim = length(points), length(lo) puting upper and lower bounds
𝒱 = [(point .- lo) ./ (hi .- lo) for point in points] on the star discrepancy of a set
ranges = [range(0, 1, length)[1:end-1] for length in lengths] of points on a space bounded
steps = [Float64(r.step) for r in ranges] by lo and hi. It uses a partition
ℬ = Hyperrectangle(low=zeros(dim), high=ones(dim)) specified by lengths, which con-
lbs, ubs = [], [] tains the number of subrectangles
for grid_point in Iterators.product(ranges...) in each dimension. The algorithm
h⁻ = Hyperrectangle(low=zeros(dim), high=[grid_point...]) first normalizes the points to lie in
h⁺ = Hyperrectangle(low=zeros(dim), high=grid_point .+ steps) the unit hypercube. It then creates
𝒱h⁻ = length(filter(v -> v ∈ h⁻, 𝒱)) the partition over the unit hyper-
𝒱h⁺ = length(filter(v -> v ∈ h⁺, 𝒱)) cube and computes the upper and
push!(lbs, max(abs(𝒱h⁻ / n - volume(h⁻) / volume(ℬ)), lower bounds on star discrepancy
abs(𝒱h⁺ / n - volume(h⁺) / volume(ℬ)))) using equation (5.6). We use the
push!(ubs, max(𝒱h⁺ / n - volume(h⁻) / volume(ℬ), LazySets.jl package to represent
volume(h⁺) / volume(ℬ) - 𝒱h⁻ / n)) hyperrectangles.
end
return maximum(lbs), maximum(ubs)
end

Average Dispersion Star Discrepancy Figure 5.10. Selecting the next goal
state for RRT applied to the con-
tinuum world problem using av-
erage dispersion and star discrep-
ancy coverage metrics. The plots
show the grid points used as can-
didates for the next goal state. The
color of each grid point indicates
the increase in coverage that would
result from adding that grid point
to the tree with darker colors indi-
cating a greater increase. For star
discepancy, the colors represent
the lower bound. The star indicates
the goal state selected by RRT. Be-
cause star discrepancy only focuses
on the worst-case hyperrectangle,
it is not as smooth as average dis-
persion.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
78 c hap ter 5. falsification throug h planning

Suppose we want to apply coverage heuristics when using RRT on the con- Example 5.3. RRT applied to the
continuum world problem using
tinuum world problem. The following code implements a version of the coverage heuristics. The plots illus-
select_goal function that uses coverage based on average dispersion to trate the effect of selecting the next
guide the search. goal state based on coverage rather
function select_goal(tree; m=5) than randomly selecting it.
a, b, lengths = [0, 0], [10, 10], [10, 10]
points = [node.state for node in tree]
sgoals = [rand.(Distributions.Uniform.(a, b)) for _ in 1:m]
dispersions = [average_dispersion([points..., sgoal], a, b, lengths)
for sgoal in sgoals]
coverages = 1 .- dispersions
return sgoals[argmax(coverages)]
end

We first collect the states visited so far from the nodes of the tree and sample
m potential goal states uniformly from the state space. We then compute the
new average dispersion if each goal state were added to the tree. The goal
state that results in the greatest increase in coverage is selected. The plots
show the resulting trees when using random goals and coverage-based goals.
Using the coverage-based goal selection results in a wider tree that covers
more of the state space.

Random Goal Coverage Goal

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
5.3. heuristic search 79

We can terminate the tree search when the growth metric is sufficiently small.
It is important to note that this growth metric does not provide any guarantees
about the coverage of the search tree. Even when growth is small, there may still be

Average Dispersion Coverage


1
unexplored regions of the state space that are reachable under rare circumstances.
0.8
For example, every state in the continuum world problem is reachable through a
sequence of disturbances, but the average dispersion coverage metric plateaus at 0.6

a number less than 1 (see figure 5.11). Some states are extremely unlikely to be 0.4
reached under the nominal disturbance model. 0.2

0
5.3.3 Alternative Objectives 100 200 300
Iteration
As noted in the previous chapter, we may want to go beyond a simple search for
Figure 5.11. The average disper-
failures and incorporate other objectives into the search process. For example, we sion coverage metric over iterations
may be interested in finding the shortest path to failure or the most likely failure. of RRT applied to the continuum
world problem.
We can incorporate these objectives into RRT by modifying how we compute the
objectives in the select step (algorithm 5.9).
First, we define a cost function c that maps a node to a cost of transitioning
to the node from its parent. For example, the cost might be a measure of the
distance between the node’s state and its parent’s state. To ensure that the tree
search algorithm is still encouraged to reach the goal, all costs must be positive.
The total cost of a path is the sum of the costs of all nodes in the path. Our goal is
to find the path to the goal with the lowest total cost.
We compute an objective for each node consisting of two components: the
total cost of the current path from the root to the node and an estimate of the
remaining cost to get from the node to the goal state. The remaining cost estimate 10
This algorithm is a simplified
comes from a heuristic function h. One potential heuristic is the distance from the version of the RRT⇤ algorithm.
S. Karaman and E. Frazzoli, “In-
current node to the goal state. Algorithm 5.9 implements this process given a cost cremental Sampling-Based Algo-
function and heuristic function.10 It provides default cost and heuristic functions rithms for Optimal Motion Plan-
ning,” Robotics Science and Systems
that will guide the search toward the shortest path. VI, vol. 104, no. 2, pp. 267–274,
To search for the most likely failure, we can use a cost function related to the 2010.
negative log likelihood of the disturbance for the current node. We add a constant
factor of the maximum possible log likelihood according to the disturbance dis-
tribution to ensure that the cost is positive. For the heuristic function, we need to
estimate the log likelihood of the remaining path required to reach the goal state.
One option is to use the distance to the goal state as a proxy for this value since
longer paths tend to result in lower negative log likelihoods. Adding a scaling

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
80 c h ap ter 5. falsification through planning

distance_c(node) = norm(node.parent.state .- node.state) Algorithm 5.9. Algorithm for com-


distance_h(node, sgoal) = norm(sgoal .- node.state) puting objectives based on a cost
function c and heuristic function
function cost_objectives(tree, sgoal; c=distance_c, h=distance_h) h. The algorithm first traverses the
costs = Dict() tree and accumulates the total cost
queue = [tree[1]] of each node. It then computes the
while !isempty(queue) heuristic for each node and adds it
node = popfirst!(queue) to the total cost to get the objective
if isnothing(node.parent) values. We supply default imple-
costs[node] = 0.0 mentations of the cost and heuristic
else functions that will encourage RRT
costs[node] = c(node) + costs[node.parent] to search for the shortest path.
end
for child in node.children
push!(queue, child)
end
end
heuristics = [h(sgoal, node) for node in tree] Nominal Path
objectives = [costs[node] for node in tree] .+ heuristics
return objectives Most Likely
end Failure

Shortest
Failure
factor to the cost function to balance between the heuristic and cost may improve
performance. Figure 5.12 shows the results from using RRT to find the shortest
path to failure and most likely failure for the continuum world problem. Figure 5.12. The nominal path
for the continuum world problem
While algorithm 5.9 will often find a low cost path to failure, it is not necessarily compared to the shortest path to
guaranteed find the path with the lowest possible cost. Certain conditions on failure and the most likely failure
path found by RRT. The most likely
the nature of the problem and the heuristic function are required to guarantee failure path stays closer to the nom-
optimality. Algorithm 5.9 will converge to the optimal path if the state space inal path before moving toward the
and disturbance space are discrete and the heuristic function is admissible.11 A obstacle.
11
When these conditions are met,
heuristic is admissible if it is guaranteed to never overestimate the cost of reaching the algorithm is the same as the
the goal state. In shortest path problems, the straight-line distance to the goal A⇤ search algorithm. P. E. Hart, N. J.
Nilsson, and B. Raphael, “A For-
state is an admissible heuristic. Example 5.4 demonstrates this result on the grid
mal Basis for the Heuristic Deter-
world problem. mination of Minimum Cost Paths,”
IEEE Transactions on Systems Sci-
ence and Cybernetics, vol. 4, no. 2,
5.4 Monte Carlo Tree Search pp. 100–107, 1968.
For a survey, see C. B. Browne, E.
12

Monte Carlo tree search (MCTS) (algorithm 5.10) is a tree search algorithm that Powley, D. Whitehouse, S. M. Lu-
cas, P. I. Cowling, P. Rohlfshagen, S.
balances between exploration and exploitation.12 It explores by selecting nodes Tavener, D. Perez, S. Samothrakis,
that have not been visited many times and exploits by biasing the search tree and S. Colton, “A Survey of Monte
Carlo Tree Search Methods,” IEEE
toward paths that seem most promising. MCTS determines which paths are most Transactions on Computational Intel-
ligence and AI in Games, vol. 4, no. 1,
pp. 1–43, 2012.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
5.4. monte carlo tree search 81

Since the state space and disturbance space for the grid world problem are Example 5.4. Example of using
RRT to find the shortest path to
discrete, we are guaranteed to find the shortest path to failure and the most failure and most likely failure for
likely failure path as long as we select an admissible heuristic function. For the grid world problem. The plots
show the search tree at different it-
the shortest path to failure, an admissible heuristic is the Euclidean distance erations of the algorithm. The most
between the current state and the goal state. This distance will always be less likely failure path stays closer to
than or equal to the actual cost of reaching the goal state since the shortest the nominal path (highlighed in
gray) before moving toward the ob-
path between two points is a straight line. For the most likely failure path, stacle.
we can use the likelihood of a straight line trajectory from the current state to
the goal state assuming that it used the most likely disturbance at each step.
The plots show the results. As in the continuum world problem (figure 5.12),
the most likely failure path stays closer to the nominal path before moving
toward the obstacle.
Iteration 10 Iteration 25 Converged
Shortest Path
Most Likely Path

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
82 c h ap ter 5. falsification throug h planning

promising by maintaining a value function Q(s, x ) for each node in the tree. Given
a failure objective (section 4.5), Q(s, x ) represents the expected future objective
value when applying disturbance x from state s. MCTS searches for the path with
the lowest objective value.13 13
When the objective function is
In the select step, MCTS traverses the tree starting at the root node. At each the most likely failure objective,
this technique is sometimes re-
node, we determine whether to select it for the extend step based on its current ferred to as adaptive stress testing. R.
number of children and number of visits N (s). Specifically, we extend the node if Lee, O. J. Mengshoel, A. Saksena,
R. W. Gardner, D. Genin, J. Silber-
the number of children is less than or equal to kN (s)a , where k and a are algorithm mann, M. Owen, and M. J. Kochen-
hyperparameters. This process is referred to as progressive widening. If the number derfer, “Adaptive Stress Testing:
Finding Likely Failure Events with
of children exceeds this value, we continue to traverse the tree using a heuristic
Reinforcement Learning,” Journal
that balances between exploration and exploitation. of Artificial Intelligence Research,
vol. 69, pp. 1165–1201, 2020.

Iteration 100 Iteration 200 Failure Found Figure 5.13. MCTS applied to find
a failure in the continuum world
problem. Darker nodes and edges
were visited more often. MCTS
finds a failure (highlighted in red)
after 258 iterations.

A common heuristic is the lower confidence bound (LCB) (algorithm 5.11), which
is defined as s
log N (s)
Q(s, x ) c (5.8)
N (s, x )
where N (s, x ) is the number of times we took the path corresponding to distur-
bance x from the node corresponding to state s. The first term in equation (5.8)
exploits our current estimate of how promising a particular path is based on the
value function, and the second term is an exploration bonus. The exploration
constant c controls the amount of exploration. Higher values will lead to more
exploration. We move to the child node with the lowest LCB value and repeat the
process until we reach a node that we can extend.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
5.4. monte carlo tree search 83

struct MCTS <: TreeSearch Algorithm 5.10. The Monte Carlo


estimate_value # v = estimate_value(sys, ψ, node) tree search algorithm. The algo-
c # exploration constant rithm is a type of tree search al-
k # progressive widening constant gorithm and implements both the
α # progressive widening exponent select and extend! functions. The
select_disturbance # x = select_disturbance(sys, node) select function traverses the tree
k_max # number of iterations using the lcb function as a guide
end until is reaches a node that can
be extended based on its number
mutable struct MCTSNode of children. The extend! function
state # node state samples a disturbance according
parent # parent node to the select_disturbance func-
edge # (o, a, x) tion and simulates the system one
children # vector of child nodes step forward in time from the
N # visit count current node. It then estimates
Q # value estimate
the value at the new node using
end
the estimate_value function and
adds it to the tree. Finally, it propa-
function initialize_tree(alg::MCTS, sys)
gates this information back up the
return [MCTSNode(rand(Ps(sys.env)), nothing, nothing, [], 1, 0)]
tree to update the visit counts and
end
mean value estimate for each node
function select(alg::MCTS, sys, ψ, tree) in the path.
c, k, α, node = alg.c, alg.k, alg.α, tree[1]
while length(node.children) > k * node.N^α
node = lcb(node, c)
end
return node
end

function extend!(alg::MCTS, sys, ψ, tree, node)


x = alg.select_disturbance(sys, node)
o, a, s′ = step(sys, node.state, x)
Q = alg.estimate_value(sys, ψ, s′)
snew = MCTSNode(s′, node, (; o, a, x), [], 1, Q)
push!(node.children, snew)
push!(tree, snew)
while !isnothing(node)
node.N += 1
node.Q += (Q - node.Q) / node.N
Q, node = node.Q, node.parent
end
end

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
84 c h ap ter 5. falsification through planning

function lcb(node::MCTSNode, c) Algorithm 5.11. The lower confi-


Qs = [node.Q for node in node.children] dence bound algorithm. The al-
Ns = [node.N for node in node.children] gorithm computes the LCB for
lcbs = [Q - c*sqrt(log(node.N)/N) for (Q, N) in zip(Qs, Ns)] each child node according to equa-
return node.children[argmin(lcbs)] tion (5.8) and returns the child
end node with the lowest LCB.

In the extend step, MCTS samples a disturbance and simulates the system one
step forward in time from the current node. It then estimates the value at the
new node and adds it to the tree. A common technique to estimate this value is
to perform rollouts from the new node and evaluate their robustness. We can
also estimate the value using a heuristic such as distance to failure. Finally, we
propagate this information back up the tree to update the visit counts and mean
value estimate for each node in the path. Figure 5.13 shows the result of using
MCTS to find failures in the continuum world problem. The algorithm gradually
expands the tree toward the obstacle and visits promising nodes more often.
The tree search algorithms we have presented so far assumed deterministic
transitions between nodes. In other words, simulating disturbance x from state s
will always lead to the same next state s0 . However, we may not have control over
all sources of randomness for some real-world simulators, resulting in stochastic
transitions between nodes. One advantage of MCTS is that it can handle this
stochasticity. A technique called double progressive widening can be used to extend
the tree in these cases. Double progressive widening applies the progressive
widening condition to both the disturbance and next state.14 14
A. Couëtoux, J.-B. Hoock, N.
Sokolovska, O. Teytaud, and N.
Bonnard, “Continuous Upper Con-
5.5 Reinforcement Learning fidence Trees,” in Learning and In-
telligent Optimization (LION), 2011.

Reinforcement learning algorithms train agents to perform a task while they interact
with an environment.15 We can use reinforcement learning for falsification by For an introduction to reinforce-
15

training an agent to cause a system to fail. To avoid confusing the reinforcement ment learning, see R. S. Sutton and
A. G. Barto, Reinforcement Learning:
learning agent with the agent in the system under test, we call the reinforcement An Introduction, Second Edition.
learning agent an adversary. MIT Press, 2018.
Figure 5.14 shows the overall setup. At each time step, the adversary interacts
with the system by selecting a disturbance x. The system then steps forward in
time and produces a reward r for the adversary related to the failure objective.
We refer to a series of these time steps as an episode. Reinforcement learning

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
5.6. simulator requirements 85

algorithms train the adversary to maximize reward using data gathered over a
System
series of episodes. Specifically, the adversary learns a policy padv (s) that maps
states to disturbances. Once the adversary is trained, we can use it to search for
failures by performing rollouts of the system using disturbances selected by the x r
adversary’s policy.
Similar to MCTS, reinforcement learning algorithms balance between explo- Adversary
ration and exploitation. The adversary explores by trying different disturbances
in each state, and exploits by selecting disturbances that are likely to lead to a Figure 5.14. Reinforcement learn-
failure. Typically, the adversary will explore more at the beginning of training to ing for falsification. We train an ad-
versary to select disturbances that
gather data that it can later on exploit. Reinforcement learning algorithms balance will cause a system to fail. The ad-
between these two objectives to maximize sample efficiency. Sample efficient algo- versary receives feedback in the
form of a reward signal.
rithms require as few episodes as possible to learn an effective policy. A number
of sample efficient reinforcement learning algorithms have been developed, and
we can use off-the-shelf implementations of them to efficiently find failures of
complex systems.16 16
Off-the-shelf reinforcement
Another advantage of a reinforcement learning approach is its ability to gener- learning packages provide im-
plementations of a variety of
alize. The shooting methods and tree search algorithms discussed in this chapter reinforcement learning algorithms.
all required a specific initial state from which to find a failure path. Using rein- For example, see the Crux.jl
package in the Julia ecosystem.
forcement learning to find failures removes this necessity. Because the adversary
learns a policy over the entire state space, we can perform a rollout from any
initial state to search for a failure. Example 5.5 demonstrates this result on the
continuum world problem using an off-the-shelf reinforcement learning package.

5.6 Simulator Requirements

Selecting an appropriate falsification algorithm for a given system is often depen-


dent on the capabilities of the system simulator. Some commercial simulators, for
example, do not provide access to their internal models. Simulators also differ in
the aspects of the simulation that the user can control. The falsification algorithms
we discussed in this chapter and the previous chapter impose different require-
ments on the system simulator. Figure 5.15 summarizes these requirements.
To apply any of the algorithms from chapter 4, the simulator must be capable of
performing a rollout. For a black-box rollout, the simulator takes as input an initial
state s and a vector of disturbances x from the user and outputs the objective value
f (t ). For these simulators, we can perform falsification using direct sampling or
fuzzing. We can also use optimization-based falsification algorithms that only rely

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
86 c hap ter 5. falsification through planning

To apply the reinforcement learning algorithms implemented in the Crux.jl Example 5.5. Example of using re-
inforcement learning to find fail-
package to the continuum world problem, we need to define the following: ures in the continuum world prob-
initial_state_dist = Product([Distributions.Uniform(0, 10), lem. The plots show rollouts of the
Distributions.Uniform(0, 10)]) adversary policy starting from dif-
function interact(s, x, rng) ferent initial states after different
_, _, s′ = step(cw, s, Disturbance(0, x, 0)) numbers of training episodes. Fail-
r = Float32(robustness(s, ψ.formula) - robustness(s′, ψ.formula)) ure trajectories are highlighted in
norm(s′ - [4.5, 4.5]) < 0.5 ? r += 10.0 : nothing red. The adversary is able to find
return (sp=s′, r=r) failures from most initial states af-
end
ter 50,000 training episodes. For
more information on the solving
We first define an initial state distribution that covers the entire state space,
code, see the Crux.jl documenta-
allowing us to find failures starting from any state. The interact function tion.
defines how the adversary interacts with the system. Given a state s and a
disturbance x, the function simulates the system one step forward in time
and returns a tuple with the next state s′ and reward r. The random number
generator rng is a required input for Crux.jl but is not used in this case since
the function is deterministic.
The reward is based on the change in robustness for the current step. We
also add a large reward for reaching a failure state. With these definitions, we
can apply any of the reinforcement learning algorithms in the Crux.jl pack-
age to find failures. The plots show rollouts of the adversary policy starting
from different initial states after different numbers of training episodes using
an algorithm called Proximal Policy Optimization (PPO). The adversary is
able to find failures from most initial states after 50,000 training episodes.

5,000 Episodes 30,000 Episodes 50,000 Episodes

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
5.6. simulator requirements 87

Figure 5.15. Overview of simulator


Algorithm Categories Simulator Requirements requirements for the various cat-
egories of falsification algorithms.
The first two rows are related to the
direct sampling Black-Box Rollout
fuzzing
optimization-based falsification al-
s1 , x s2 s3 sd f (t )
direct methods step 1 step 2 ... step d gorithms discussed in chapter 4,
population methods and the second two rows relate to
the planning algorithms discussed
White-Box Rollout
in this chapter. Variables shown in
blue are variables that the user of
first-order methods s1 , x s2 s3 sd f (t )
second-order methods step 1 step 2 ... step d the simulator has control over. We
cannot observe any aspects of the
simulator shown in gray.
Episode
reinforcement learning s1 , x1 s2 , r2 , x2 s3 , r3 , x3 sd , rd , xd
step 1 step 2 ... step d

Extend
tree search
multiple shooting
s, x s0 , c
step

on evaluations of the objective function such as direct methods and population


methods. A white-box rollout has the same inputs and outputs as a black-box
rollout, but it also allows us to observe the internal model of this system. We can
compute gradients and Hessians of the objective function for white-box rollouts,
allowing us to apply first- and second-order optimization methods.
The planning algorithms discussed in this chapter require the simulator to be
able to perform single steps. Reinforcement learning algorithms operate using
episodes, which consist of a series of steps starting from a user-specified initial
state. At each step, the reinforcement learning agent observes the next state and
reward from the previous step and selects a disturbance. Tree search algorithms
and multiple shooting methods require the simulator to be able to take steps from
arbitrary states. Given a state s and a disturbance x, the simulator must be able to
simulate the system one step forward in time and return the next state s0 . For tree
search algorithms that use cost functions, the simulator must also return the cost
c of taking the step.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
88 c h ap ter 5. falsification through planning

5.7 Summary

• Planning algorithms account for the temporal aspect of the falsification problem
and break it into a series of smaller problems.

• Shooting methods perform optimization-based falsification by optimizing over


a series of trajectory segments, which may increase efficiency for systems where
small changes in the disturbances at the beginning of a trajectory can have a
significant effect later.

• Tree search algorithms search the space of possible trajectories as a tree and
iteratively grow the tree in search of a failure trajectory.

• Heuristic search algorithms use heuristics such as distance to failure, coverage,


and robustness to guide the search.

• Monte Carlo tree search balances between exploration and exploitation to


efficiently search the space of possible trajectories.

• Reinforcement learning algorithms can be used train an adversary to produce


failures in a sample efficient manner.

• The capabilities of a system’s simulator determine which falsification algo-


rithms can be applied.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
6 Failure Distribution

While the falsification algorithms in the previous chapters search for single failure
events, it is often desirable to understand the distribution over failures for a
given system and specification. This distribution is difficult to quantify exactly for
many real-world systems. Instead, we can approximate the failure distribution by
drawing samples from it. This chapter discusses methods for sampling from the
failure distribution. We present two categories of sampling methods. First, we
discuss rejection sampling, which produces samples from a target distribution
by accepting or rejecting samples from a different distribution. We then present
Markov chain Monte Carlo (MCMC) methods. MCMC methods generate samples / y)
p(t | t 2
from a target distribution using a chain of correlated samples. We conclude with
a discussion of probabilistic programming, which allows us to scale MCMC
p(t )
methods to complex, high-dimensional systems.

6.1 Distribution over Failures


4 2 0 2 4
t

The distribution over failures for a given system with specification y is represented Figure 6.1. The distribution over
failures for a simple system where
by the conditional probability p(t | t 2 / y). We can write this probability as
trajectories consist of only a single
state that is sampled from a normal
1{ t 6 2 y } p ( t ) distribution (black). A failure oc-
/ y) = R
p(t | t 2 (6.1)
1{t 62 y} p(t ) dt curs when the sampled state is less
than 1. The area of the shaded re-
where 1{·} is the indicator function and p(t ) is the probability density of the gion corresponds to the integral in
equation (6.1). The failure distribu-
nominal trajectory distribution for trajectory t. Figure 6.1 shows the failure distri- tion (red) is the probability density
bution for a simple system where trajectories consist of only a single state that is function of the nominal distribu-
sampled from a normal distribution. For most systems, the failure distribution is tion in the failure region scaled by
this value.
difficult to compute exactly because doing so requires solving the integral in the
denominator of equation (6.1) to compute the normalizing constant. The value of
90 c h ap ter 6. failure distribution

this integral corresponds to the probability of failure for the system. We discuss 1
For a detailed overview, see C. P.
Robert and G. Casella, Monte Carlo
methods to estimate this quantity in chapter 7.
Statistical Methods. Springer, 1999,
While we cannot compute the probability density of the failure distribution vol. 2.
exactly, we can use its unnormalized probability density p̄(t | t 2 / y) to draw
samples from it. The unnormalized probability density is given by

/ y ) = 1{ t 6 2 y } p ( t )
p̄(t | t 2 (6.2)

Computing this density for a given trajectory only requires determining whether
it is a failure trajectory and evaluating its probability density under the nominal
trajectory distribution. The rest of this chapter discusses several methods for
sampling from this unnormalized distribution.1 With enough samples, we can
implicitly represent the distribution over failures (see figure 6.2).
Figure 6.2. Distribution over fail-
ures for the grid world prob-
6.2 Rejection Sampling lem represented implicitly through
samples. The probability of slip-
ping is set to 0.8.
Rejection sampling produces samples from a complex target distribution by ac-
cepting or rejecting samples from a different distribution that is easier to sample
from. It is inspired by the idea of throwing darts uniformly at a rectangular dart
board that encloses the graph of the density of the target distribution. If we keep
only the darts that land inside the target density, we produce samples that are
distributed according to the target distribution (see figure 6.3).
Figure 6.3. Sampling from a
In the dart board example, we are using samples from a uniform distribution truncated normal distribution by
to produce samples from an arbitrary target density. The efficiency of this process throwing darts uniformly at a rect-
angular dart board that encloses
depends on the area of the dart board that lies outside the target distribution. the graph of its density function.
If there is a large area outside the target distribution, many of the darts will The samples on the bottom are ob-
be rejected, and we will require more darts to accurately represent the target tained by moving all of the darts
that land inside the target distribu-
distribution. One way to improve efficiency is to use a different dart board that tion to the bottom of the dart board.
more closely matches the shape of the target distribution. In other words, we may These samples are distributed ac-
cording to the target distribution.
want to draw samples from a different distribution that is still easy to sample
from but more closely matches the target distribution. We call this distribution a 2
In the dart board analogy, we can
proposal distribution. think of this acceptance criteria as a
Algorithm 6.1 implements the rejection sampling algorithm given a target two step process. First, we sample
the x-coordinate of the dart from
distribution with density function p̄(t ) and a proposal distribution with density the proposal distribution. Second,
function q(t ). At each iteration, we draw a sample t from the proposal distribution we select its y-coordinate randomly
and accept it with probability proportional to p̄(t )/(cq(t )).2 To ensure that between the bottom of the board
and cq(t ). If it falls under p(t ), it
the proposal distribution fully encloses the target distribution, we require that is accepted.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
6.2. rejection sampling 91

q(t ) > 0 whenever p̄(t ) > 0 and that c is selected such that p̄(t )  cq(t ) for all
t. The density function of the target distribution does not need to be normalized.

struct RejectionSampling Algorithm 6.1. The rejection sam-


p̄ # target density pling algorithm for sampling from
q # proposal trajectory distribution a target distribution. At each itera-
c # constant such that p(τ) ≤ cq(τ) tion, the algorithm performs a roll-
k_max # max iterations out using the proposal trajectory
end distribution, computes the accep-
tance ratio, and accepts the sample
function sample_failures(alg::RejectionSampling, sys, ψ) with probability equal to the accep-
p̄, q, c, k_max = alg.p̄, alg.q, alg.c, alg.k_max tance ratio.
τs = []
for k in 1:k_max
τ = rollout(sys, q)
if rand() < p̄(τ) / (c * pdf(q, τ))
push!(τs, τ)
end
end
return τs
end
p(t )

/ y)
p̄(t | t 2
To sample from the failure distribution, we use the unnormalized density in
equation (6.2) as the target density. A common choice for the proposal distribution 2 0 2
4 4
is the nominal trajectory distribution. To use this proposal, we must select a value
for c such that 1{t 62 y} p(t )  cp(t ). Selecting c = 1 satisfies this condition and / y)
p(t | t 2

causes the acceptance ratio to reduce to 1{t 62 y}. In other words, we will accept
a sample if it is a failure trajectory and reject it otherwise. Figure 6.4 shows an
4 2 0 2 4
example that uses the nominal trajectory distribution to sample from the failure t
distribution shown in figure 6.1.
Figure 6.4. Rejection sampling us-
If failures are unlikely under the nominal distribution, we will require many
ing the nominal trajectory distribu-
samples to produce a representative set of samples from the failure distribution. tion as the proposal distribution to
In this case, we may be able to improve efficiency by using domain knowledge to sample from the failure distribu-
tion shown in figure 6.1. The plot
select a proposal distribution that more closely matches the shape of the failure on the top shows the target den-
distribution. For example, failures occur at negative values in the simple system sity (red) and the proposal den-
sity (gray). Accepted samples are
shown in figure 6.1, so we may be able to improve efficiency by shifting the
highlighted in red. The plot on the
proposal distribution to the left. bottom shows a histogram of the
When we select the proposal distribution for rejection sampling, we must also accepted samples compared to the
density function of the failure dis-
select a value for c to ensure that the proposal distribution fully encloses the target tribution.
distribution for all t. Figure 6.5 shows an example that uses a shifted proposal
distribution to sample from the failure distribution shown in figure 6.1 for two

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
92 c h ap ter 6. failure distribution

different values of c. We want to select c to be as tight as possible to achieve the


highest efficiency. In general, selecting a good proposal distribution and value
for c requires domain knowledge and can be challenging for high-dimensional
systems with long time horizons. If c is too loose, rejection sampling may be too
inefficient to be useful (see example 6.1). The next section discusses techniques
that tend to perform better in these cases.

50 samples 200 samples 1,000 samples Figure 6.5. Using a hand-designed


proposal distribution to apply re-
p(t ) jection sampling to the simple sys-
Mq(t ) tem in figure 6.1 for two different
/ y)
p̄(t | t 2
values of M. The proposal distribu-
tion is a normal distribution shifted
to the left (q(t ) = N t | 1, 12 ).
c=1

4 2 0 2 4 4 2 0 2 4 4 2 0 2 4
The top row shows the results for
/ y)
p(t | t 2
M = 1, which is a loose bound. The
bottom row shows the results for
M = 0.6065, which is the tightest
possible value for M. More sam-
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 ples are accepted using the tighter
value for M resulting in greater ef-
t t t
ficiency.

50 samples 200 samples 1,000 samples

p(t )
Mq(t ) / y)
p̄(t | t 2
c = 0.6065

4 2 0 2 4 4 2 0 2 4 4 2 0 2 4

/ y)
p(t | t 2

4 2 0 2 4 4 2 0 2 4 4 2 0 2 4
t t t

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
6.2. rejection sampling 93

Suppose we want to use rejection sampling to sample failures from an in- Example 6.1. Example of the chal-
lenges of using rejection sampling
verted pendulum system where the standard deviation of the sensor noise for high-dimensional systems with
for each state variable is 0.1. From example 4.3, we know that failures are rare long time horizons. In this exam-
ple, we compute the tightest value
under the nominal trajectory distribution, so rejection sampling using the we can select for c based on domain
nominal trajectory distribution as a proposal will be inefficient. We also saw knowledge for the inverted pendu-
in example 4.3 that when we instead sampled trajectories from a distribution lum system and show that it is pro-
hibitively large.
where the standard deviation of the sensor noise was 0.15, we were able to
find failures. Therefore, we may want to use this distribution as a proposal
for rejection sampling.
We must then select a value for c such that

p(t )  cq(t )
d d
p(s1 ) ’ N ( xt | 0, (0.1)2 I )  cp(s1 ) ’ N ( xt | 0, (0.15)2 I )
t =1 t =1
d ✓ ◆
N ( xt | 0, 0.01I )
’ N ( xt | 0, 0.0225I )
c
t =1

where we assume that the initial state distribution is the same for the proposal
and target. The term in the product will be maximized when xt = [0, 0] for
all t. Plugging this result into the product and assuming a depth of 40, we
find that ✓ ◆
N (0 | 0, 0.01I ) 40
c
N (0 | 0, 0.0225I )
1.2226 ⇥ 1014  c

Therefore, the tightest value we can select for c is 1.2226 ⇥ 1014 . Using this
value, our acceptance probabilities end up being very small (on the order of
10 15 ), and rejection sampling is inefficient.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
94 c h ap ter 6. fai lure distribution

6.3 Markov Chain Monte Carlo

Markov chain Monte Carlo (MCMC) algorithms generate samples from a target
distribution by sampling from a Markov chain.3 A Markov chain is a sequence of 3
A detailed overview of MCMC
techniques is provided in C. P.
random variables where each variable depends only on the previous one. MCMC
Robert and G. Casella, Monte Carlo
algorithms begin by initializing a Markov chain with an initial sample t. At each Statistical Methods. Springer, 1999,
iteration, they use the current sample t to generate a new sample t 0 by sampling vol. 2.

from a conditional distribution g(· | t ). This distribution is sometimes referred to


as a kernel.4 We accept or reject the new sample based on an acceptance criteria. If 4
This distribution is also some-
the new sample is accepted, we set t = t 0 and continue to the next iteration. If times referred to as a proposal dis-
tribution. It differs from the pro-
the new sample is rejected, we keep the previous sample. posal distribution used in rejection
Given certain properties of the kernel and acceptance criterion, MCMC algo- sampling in that it is conditioned
on the previous sample.
rithms are guaranteed to converge to the target distribution in the limit of infinite
samples. However, the initial samples may not be representative of the target
distribution. For this reason, we often specify a burn-in period in which the initial
samples are discarded. Furthermore, unlike rejection sampling, the samples pro-
duced by MCMC algorithms are not independent from one another. Each sample
in the chain depends on the previous one. Therefore, it is also common to thin
the samples by only keeping every hth sample. Several variations of MCMC differ
in how they implement the acceptance criteria and the kernel.

6.3.1 Metropolis-Hastings
One of the most common MCMC algorithms is the Metropolis-Hastings algorithm.5 5
W. K. Hastings, “Monte Carlo
The Metropolis-Hastings algorithm accepts a new sample t 0 given the current Sampling Methods Using Markov
Chains and Their Applications,”
sample t with probability Biometrika, vol. 57, no. 1, pp. 97–97,
p̄(t 0 ) g(t | t 0 ) 1970.
(6.3)
p̄(t ) g(t 0 | t ) 6
When the kernel is symmet-
where p̄ is the unnormalized target density. To sample from the failure distribution, ric, the algorithm is called the
Metropolis algorithm: N. Metropo-
we set p̄ = 1{t 62 y} p(t ). Since we are taking a ratio of the densities, the target lis, A. W. Rosenbluth, M. N. Rosen-
density does not need to be normalized. The kernel g(· | t ) is often chosen to bluth, A. H. Teller, and E. Teller,
“Equation of State Calculations by
be a symmetric distribution, meaning that g(t 0 | t ) = g(t | t 0 ).6 In this case, Fast Computing Machines,” Journal
the acceptance criteria reduces to p̄(t 0 )/ p̄(t ). Intuitively, if t 0 is more likely than of Chemical Physics, vol. 21, no. 6,
t, it is always accepted. If t 0 is less likely than t, it is accepted with probability pp. 1087–1092, 1953. A common
choice of a symmetric kernel is a
proportional to the ratio of the densities. Gaussian distribution centered at
the previous sample.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
6.3. markov chain monte carlo 95

Algorithm 6.2 implements the Metropolis-Hastings algorithm given a target


density, a kernel, and an initial trajectory to begin the Markov chain. The kernel
is a conditional distribution that takes in a trajectory and produces a trajectory
distribution. Example 6.2 shows an example of a kernel for the inverted pendulum
system. The next sample is generated by performing a rollout using this distri-
bution. We then accept or reject the new sample based on the acceptance ratio
in equation (6.3). Figure 6.6 shows the result of using the Metropolis-Hastings
algorithm to sample from the failure distribution shown in figure 6.1.

struct MCMCSampling Algorithm 6.2. The Metropolis-


p̄ # target density Hastings algorithm for sampling
g # kernel: τ′ = rollout(sys, g(τ)) from a target distribution. The ker-
τ # initial trajectory nel function g must take in a trajec-
k_max # max iterations tory and return a trajectory distri-
m_burnin # number of samples to discard from burn-in bution. At each iteration, the algo-
m_skip # number of samples to skip for thinning rithm generates a new sample by
end performing a rollout using this dis-
tribution. It then accepts or rejects
function sample_failures(alg::MCMCSampling, sys, ψ) the new sample based on the accep-
p̄, g, τ = alg.p̄, alg.g, alg.τ tance ratio in equation (6.3). The al-
k_max, m_burnin, m_skip = alg.k_max, alg.m_burnin, alg.m_skip gorithm discards the first m_burnin
τs = [] samples and thins the remaining
for k in 1:k_max samples according to m_skip.
τ′ = rollout(sys, g(τ))
if rand() < (p̄(τ′) * pdf(g(τ′), τ)) / (p̄(τ) * pdf(g(τ), τ′))
τ = τ′
end
push!(τs, τ)
end
return τs[m_burnin:m_skip:end]
end

6.3.2 Smoothing
When we use algorithm 6.2 to sample from the failure distribution, we will not
accept any samples that are not failures because p̄(t ) = 1{t 62 y} p(t ) will be 0
for those samples. While this behavior is necessary for the algorithm to converge
to the failure distribution in the limit of infinite samples, it can create challenges
in practice. For example, if we initialize the Markov chain to a safe trajectory, the
algorithm will reject all samples from g(· | t ) until it samples a failure. Since
g(· | t ) typically produces trajectories similar to t, we may require many samples

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
96 c h ap ter 6. fai lure distribution

To define a Gaussian kernel for the inverted pendulum system, we must first Example 6.2. Example of a Gaus-
sian kernel for the inverted pendu-
define a trajectory distribution type (algorithm 4.3) for the pendulum. The lum system.
following code defines a trajectory distribution for the pendulum system
that uses a Gaussian distribution for the initial state and a vector of Gaussian
distributions for the observation disturbance distributions:
struct PendulumTrajectoryDistribution <: TrajectoryDistribution
μ₁ # mean of initial state distribution
Σ₁ # covariance of initial state distribution
μs # vector of means of length d
Σs # vector of covariances of length d
end
function initial_state_distribution(p::PendulumTrajectoryDistribution)
return MvNormal(p.μ₁, p.Σ₁)
end
function disturbance_distribution(p::PendulumTrajectoryDistribution, t)
D = DisturbanceDistribution((o)->Deterministic(),
(s,a)->Deterministic(),
(s)->MvNormal(p.μs[t], p.Σs[t]))
return D
end
depth(p::PendulumTrajectoryDistribution) = length(p.μs)

We can then define a kernel for the pendulum system that returns an instan-
tiation of this distribution as follows:
function inverted_pendulum_kernel(τ; Σ=0.01I)
μ₁ = τ[1].s
μs = [step.x.xo for step in τ]
return PendulumTrajectoryDistribution(μ₁, Σ, μs, [Σ for step in τ])
end

The new distribution is centered at the initial state and observation distur-
bances of the current sample. We can use this kernel with algorithm 6.2 to
sample from the failure distribution of the inverted pendulum system.

2 Figure 6.6. Metropolis-Hastings


Burn-in applied to sample from the fail-
/ y)
p(t | t 2 ure distribution shown in figure 6.1.
0 We use a Gaussian kernel with a
standard deviation of 1. The plot
t

on the left shows the samples over


2
time. The plot on the right shows a
Failure histogram of the resulting samples
Region
4 compared to the true probability
0 200 400 600 800 1,000 4 2 0 2 4 density function of the failure dis-
Iteration t tribution.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
6.3. markov chain monte carlo 97

before we sample a failure to accept, especially if t is far from the failure region.7 7
One way to avoid this behavior is
We see this behavior during the burn-in period in figure 6.6. to ensure that the initial trajectory
is a failure. The algorithms in chap-
Another challenge arises when the failure distribution has multiple modes. To ters 4 and 5 can be used to search
move between modes, the algorithm must sample a failure from one failure mode for an initial failure trajectory.
using a kernel conditioned on a trajectory from another. If the failure modes are
spread out in the trajectory space, the algorithm may require a large number of
samples before moving from one mode to another. Example 6.3 illustrates these
challenges on a simple Gaussian system.
Smoothing is a technique that addresses these challenges by modifying the target
density to make it easier to sample from.8 It relies on a notion of the distance 8
H. Delecki, A. Corso, and M. J.
to failure, which we will write as D(t ) for a given trajectory t. This distance Kochenderfer, “Model-Based Vali-
dation as Probabilistic Inference,”
is a nonnegative number that measures how close t is to a failure. For failure in Conference on Learning for Dynam-
trajectories, D(t ) should be 0. We can rewrite the target density in terms of this ics and Control (L4DC), 2023.
distance as
/ y ) = 1{ D ( t )  0} p ( t )
p̄(t | t 2 (6.4)
The indicator function causes sharp boundaries between safe and unsafe trajecto-
ries. To create a smooth version of this density, we replace the indicator function
with a Gaussian distribution with mean 0 and a small standard deviation. The
resulting smoothed density is
e = 0.8
⇣ ⌘ e = 0.5
/ y) ⇡ N D(t ) | 0, e2 p(t )
p̄(t | t 2 (6.5) e = 0.2
no smoothing
where e is the standard deviation.
For systems with temporal logic specifications, we can specify the distance 4 2 0 2 4
function using temporal logic robustness (section 3.5.2). Since robustness is t
positive when the formula is satisfied and negative when it is violated, we can
Figure 6.7. Smoothed versions
write the distance function as of the failure distribution in fig-
ure 6.1 for different values of e. As
D(t ) = max(0, r(t )) (6.6) e decreases, the smoothed distribu-
tion approaches the failure distri-
bution.
where r(t ) is the robustness of the trajectory t. Figure 6.7 shows the smoothed
version of the failure distribution in figure 6.1 for different values of e. As e
approaches 0, the smoothed density approaches the shape of failure distribution.
As e approaches infinity, the smoothed density approaches the shape of the
nominal distribution.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
98 c hap ter 6. fai lure distribution

Suppose we want to sample from the failure distribution shown in the plot Example 6.3. Example of the chal-
lenges of using MCMC to sample
on the left and we initialize our Markov chain with t = 1. We will not accept from the failure distribution given
a new sample until we draw a sample with a value less than 1. If we use a a finite sample budget. The plot
on the left demonstrates the chal-
Gaussian kernel with standard deviation 1, we have that lenges with initialization, and the
⇣ ⌘ plot on the right shows the chal-
g(t 0 | t ) = N t 0 | 1, 12 lenges of sampling from failure dis-
tributions with multiple modes.
The probability of drawing a sample less than 1 from this distribution is
0.02275 (corresponding to the shaded region in the plot on the left). Therefore,
we will require 44 samples on average before the algorithm accepts a sample.
If we were to initialize the algorithm with a sample even further from the
failure region, we would require even more samples to the point where
MCMC may not converge within a finite sample budget.
The plot on the right demonstrates the challenge of using MCMC to sample
from a failure distribution with multiple modes. In this case, the current
sample is in the mode on the left at 2.2. Using the same Gaussian kernel,
we have ⇣ ⌘
g(t 0 | t ) = N t 0 | 2.2, 12

The probability of moving to the other mode from this point is 1.3346 ⇥ 10 5 .
Therefore, we will require a large number of samples before we switch modes.

g(t 0 | t )
g(t 0 | t )
/ y)
p̄(t | t 2
/ y)
p̄(t | t 2

t t
4 2 0 2 4 4 2 0 2 4
t t

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
6.3. markov chain monte carlo 99

The smoothed failure distribution assigns a nonzero probability to all trajecto-


ries, and it assigns higher probabilities to trajectories that are close to failure. This
design allows the MCMC algorithm to more easily move between failure modes.
However, because the smoothed distribution will assign a nonzero probability to
safe trajectories, the algorithm will accept some samples that are not failures. We
can still recover the failure distribution by rejecting these samples after MCMC
has terminated. In fact, this process is equivalent to performing rejection sam-
pling with the smoothed density as the proposal distribution. Figure 6.8 and
example 6.4 show the benefit of applying MCMC with a smoothed density to
sample from a failure distributions with multiple modes.

6.3.3 Metropolis-Adjusted Langevin Algorithm


The performance of MCMC is sensitive to the choice of kernel. While a Gaussian
kernel is simple to implement, it does not scale well to high-dimensional systems
with complex failure distributions because it randomly explores the target density
without taking into account its underlying structure. We can improve performance
by selecting a kernel that takes into account this structure. For example, we can
use knowledge of the gradient of the target density to guide exploration.
The Metropolis-Adjusted Langevin Algorithm (MALA) uses a gradient-based
kernel that approximates a process known as Langevin diffusion.9 The kernel is 9
Langevin dynamics is an idea
defined as from physics that was developed to
✓ ⇣ p ⌘2 ◆ model molecular systems by physi-
g(t 0 | t ) = N t 0 | t + ar log p̄(t ), 2a (6.7) cist Paul Langevin (1872–1946).
MALA is also referred to as
where a is a hyperparameter of the algorithm.10 The MALA kernel is not symmet- Langevin Monte Carlo. U. Grenan-
der and M. I. Miller, “Representa-
ric in general. Intuitively, the kernel takes a step in the direction of the greatest tions of Knowledge in Complex
increase in log likelihood and samples from a Gaussian distribution centered at Systems,” Journal of the Royal Sta-
tistical Society: Series B (Methodolog-
the new location. The algorithm then accepts or rejects the new sample based on ical), vol. 56, no. 4, pp. 549–581,
the Metropolis-Hastings acceptance ratio in equation (6.3). We can run MALA 1994.
using algorithm 6.2 by implementing a kernel that follows equation (6.7). 10
This kernel represents a discrete
approximation of the Langevin dif-
Using the gradient to guide the sampling allows the algorithm to explore the fusion process. It approaches the
target density more efficiently than a random walk. Furthermore, when combined continuous-time Langevin diffu-
with the smoothing technique in section 6.3.2, the gradient helps to guide the sion process as a approaches 0.

algorithm toward different failure modes. Figure 6.9 compares the path taken by
a Gaussian kernel with the path taken by MALA on a simple target density. The

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
100 chap ter 6. failure distribution

No Smoothing e = 0.3 e = 0.5 Figure 6.8. The effect of smooth-


ing on the MCMC algorithm for a
failure distribution with multiple
modes. The first row shows the tar-
get density (gray) compared to the
density of the true failure distribu-

tion (red). The second row shows


the MCMC samples over time with
the failure regions shaded in red.
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 The third row shows the accepted
t t t (red) and rejected samples (gray).
The fourth row shows a histogram
of the accepted samples compared
to the true probability density func-
tion of the failure distribution. The
first column shows the results with-
t

out smoothing. The second and


third columns show the results
with different values of e. With-
0 500 1,000 1,500 2,000 0 500 1,000 1,500 2,000 0 500 1,000 1,500 2,000 out smoothing, MCMC stays in the
Iteration Iteration Iteration same failure mode for all 2,000 iter-
ations and misses the other mode.
Applying smoothing allows the al-
gorithm to more easily move be-
tween failure modes and results in
better estimates of the failure dis-

tribution given the sample budget.

4 2 0 2 4 4 2 0 2 4 4 2 0 2 4
t t t
p

4 2 0 2 4 4 2 0 2 4 4 2 0 2 4
t t t

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
6.3. markov chain monte carlo 101

The inverted pendulum system has two main failure modes: tipping over in Example 6.4. Applying smoothing
to sample from the failure distri-
the negative direction and tipping over in the positive direction. To observe bution of the inverted pendulum
the effect of smoothing on the performance of MCMC, we define the following system. Smoothing allows MCMC
two unnormalized target densities: to sample from both failure modes
p = NominalTrajectoryDistribution(inverted_pendulum, 21) # depth = 21 given a finite sample budget. The
p̄(τ) = isfailure(ψ, τ) * pdf(p, τ) plot below shows the result of run-
function p̄_smooth(τ; ϵ=0.15) ning MCMC without smoothing.
Δ = max(robustness([step.s for step in τ], ψ.formula), 0)
return pdf(Normal(0, ϵ), Δ) * pdf(p, τ) 1
end

q (rad)
The plot in the margin shows the results when we run algorithm 6.2 using p̄ 0
as the target density. We will not accept any samples that are not failures, and
we only observe failures from one failure mode. The plots below show the
1
results when we use p̄_smooth combined with rejection sampling. Smoothing 0 0.2 0.4 0.6 0.8
allows us to sample failures from both failure modes. However, we now draw Time (s)
some samples that are not failures during the MCMC (left), so we must reject
them after the algorithm has terminated to recover the failure distribution
(right).

MCMC Output Failure Distribution

1
q (rad)

1
0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8
Time (s) Time (s)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
102 chap ter 6. failure distribution

Gaussian Kernel MALA Kernel Figure 6.9. Comparison of the


paths taken by algorithm 6.2
using the Gaussian kernel and
the MALA kernel for a two-
dimensional smoothed target den-
sity with a failure region shown
in red. Brighter contours inidicate
higher density. The MALA kernel
uses the gradient of the log likeli-
hood to guide its steps and requires
fewer samples than the Gaussian
kernel to move to the failure re-
gion. The MALA kernel is also has
a higher acceptance rate.

MALA kernel enables MCMC to move more efficiently toward regions of high
likelihood.

6.3.4 Metropolis-Hastings Variations 11


Hamiltonian Monte Carlo is also
MALA is one of several variations of the Metropolis-Hastings algorithm. Other referred as Hybrid Monte Carlo.
S. Duane, A. D. Kennedy, B. J.
gradient-based variations include Hamiltonian Monte Carlo11 (HMC) and the Pendleton, and D. Roweth, “Hy-
No U-Turn Sampler12 (NUTS). HMC uses a simulation of Hamiltonian dynamics brid Monte Carlo,” Physics Letters
B, vol. 195, no. 2, pp. 216–222, 1987.
based on the gradient of the log likelihood to guide exploration. NUTS is an 12
M. D. Hoffman, A. Gelman, et al.,
extension of HMC that is less sensitive to hyperparameters. Another variation “The No-U-Turn Sampler: Adap-
of the Metropolis-Hastings algorithms is Gibbs sampling, which updates each tively Setting Path Lengths in
Hamiltonian Monte Carlo.,” Jour-
variable in the target density one at a time conditioned on the values of the other nal of Machine Learning Research
variables.13 Gibbs sampling is particularly beneficial when sampling from high- (JMLR), vol. 15, no. 1, pp. 1593–
dimensional target densities where the conditional distributions are easier to 1623, 2014.
13
G. Casella and E. I. George, “Ex-
sample from than the joint distribution. plaining the Gibbs Sampler,” The
American Statistician, vol. 46, no. 3,
pp. 167–174, 1992.
6.4 Probabilistic Programming

Probabilistic programming is a technique for specifying probabilistic models as 14


An in-depth overview of proba-
computer programs in a way that allows inference to be performed automati- bilistic programming is provided
in G. Barthe, J.-P. Katoen, and A.
cally.14 By specifying the model of a given system as a probabilistic program, we Silva, Foundations of Probabilistic
can apply a variety of MCMC algorithms to sample from the failure distribution. Programming. Cambridge Univer-
sity Press, 2020.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
6.4. probabilistic programming 103

Furthermore, probabilistic programming tools are often combined with autod-


ifferentiation tools, which allows us to automatically compute the gradient of
the target density for use in gradient-based MCMC algorithms.15 These features 15
A. Griewank and A. Walther,
allow us to sample from the failure distribution of complex systems without the Evaluating Derivatives: Principles
and Techniques of Algorithmic Differ-
need for significant manual overhead. entiation, 2nd ed. SIAM, 2008.

struct ProbabilisticProgramming Algorithm 6.3. Probabilistic


Δ # distance function: Δ(𝐬) programming algorithm for
mcmc_alg # e.g. Turing.NUTS() sampling from the failure distri-
k_max # number of samples bution. The algorithm uses the
d # trajectory depth Turing.jl package to specify
ϵ # smoothing parameter the rollout function as a prob-
end abilistic model. It adds a log
probability term equivalent to the
function sample_failures(alg::ProbabilisticProgramming, sys, ψ) smoothed indicator function in
Δ, mcmc_alg = alg.Δ, alg.mcmc_alg equation (6.5) to specify that we
k_max, d, ϵ = alg.k_max, alg.d, alg.ϵ want to sample failure trajectories.
It then generates samples using
@model function rollout(sys, d; xo=fill(missing, d), the specified MCMC algorithm.
xa=fill(missing, d), Turing.jl supports a variety of
xs=fill(missing, d)) MCMC algorithms, including
p = NominalTrajectoryDistribution(sys, d)
Metropolis-Hastings, HMC,
s ~ initial_state_distribution(p)
NUTS, and Gibbs sampling.
𝐬 = [s, [zeros(length(s)) for i in 1:d]...]
for t in 1:d
D = disturbance_distribution(p, t)
s = 𝐬[t]
xo[t] ~ D.Do(s)
o = sys.sensor(s, xo[t])
xa[t] ~ D.Da(o)
a = sys.agent(o, xa[t])
xs[t] ~ D.Ds(s, a)
𝐬[t+1] = sys.env(s, a, xs[t])
end
Turing.@addlogprob! logpdf(Normal(0.0, ϵ), Δ(𝐬))
end

return Turing.sample(rollout(sys, d), mcmc_alg, k_max)


end

Algorithm 6.3 writes the rollout function as a probabilistic program that can be 16
We use a probabilistic program-
used to sample from the smoothed failure distribution.16 Similar to algorithm 4.5, ming package written for the Julia
language called Turing.jl.
the probabilistic programming model samples an initial state from the initial
state distribution and steps the system forward in time by sampling from the
disturbance distribution at each time step. However, rather than explicitly drawing

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
104 chap ter 6. failure distribution

the samples, the model only specifies the distributions from which the samples
are drawn. The probabilistic programming tool handles the sampling and keeps
track of the probability associated with each draw automatically.
To specify that we want to sample failure trajectories, we add a log proba-
bility term for the smoothed indicator function in equation (6.5). Probabilistic
programming tools often perform operations in log space for numerical stability.
Adding this term in log space is equivalent to multiplying the target density by
the smoothed indicator function. Example 6.5 demonstrates how to use algo-
rithm 6.3 to sample from the failure distribution of the inverted pendulum system.
It runs the algorithm twice to produce two chains that capture two distinct failure
modes. In addition to smoothing, running multiple MCMC chains from different
starting points is another method to improve performance of MCMC for failure
distributions with multiple modes.

To use probabilistic programming to sample from the failure distribution of Example 6.5. Sampling from
the failure distribution of the in-
the inverted pendulum system, we can use the following code to set up the verted pendulum system using al-
MCMC algorithm and distance function: gorithm 6.3. The plot shows the re-
mcmc_alg = Turing.NUTS(10, 0.65, max_depth=6) sult of running the algorithm twice
Δ(𝐬) = max(robustness(𝐬, ψ.formula, w=1.0), 0) to produce two MCMC chains. The
initial samples that are not failures
The code sets up the No U-Turn Sampler (NUTS) MCMC algorithm. Since are discarded during the burn-in
NUTS relies on the gradient of the target density, we use smoothed robustness period.

in the distance function so that the gradient exists. The first two parame- 1
ters in the NUTS constructor are the number of adaptation steps and the Chain 1
target acceptance rate. The plot shows the result of running algorithm 6.3
q (rad) 0
with the specified parameters. We run the algorithm twice to produce two
MCMC chains. Running multiple chains from different starting points is an- Chain 2

other method to improve performance for failure distributions with multiple 1


0 0.2 0.4 0.6 0.8
modes.
Time (s)

6.5 Summary

• In general, it is difficult to compute the distribution over failures exactly, but


we can compute its unnormalized density.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
6.5. summary 105

• Using an unnormalized density over failures, we can apply a variety of algo-


rithms to draw samples.

• Rejection sampling works by drawing independent samples from a proposal


distribution and accepting them with probability proportional to the ratio of
the target density to the proposal density.

• The performance of rejection sampling depends on the choice of proposal


distribution, and it can be difficult to select a good proposal distribution for
high-dimensional systems.

• Markov chain Monte Carlo (MCMC) algorithms sample from the target dis-
tribution by drawing samples from a Markov chain and scale well to high-
dimensional systems.

• MCMC is only guaranteed to converge to the target distribution in the limit


of infinite samples, but we cannot generate an infinite number of samples in
practice.

• We can use heuristics such as smoothing and gradient-based kernels to improve


the performance of MCMC with a finite sample budget.

• Probabilisitic programming is a tool that allows us to specify probabilistic


models as computer programs and can be used to sample from the failure
distribution of complex systems.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7 Failure Probability Estimation

After searching for the potential failure modes of a system, we may also want to
estimate its probability of failure. This chapter presents several techniques for
estimating this quantity from samples. We begin by discussing a direct estimation
approach that uses samples from the nominal trajectory distribution to estimate
the probability of failure. If failures are rare, this approach may be inefficient and
require a large number of samples to produce a good estimate. The remainder of
the chapter discusses more efficient estimation techniques based on importance
sampling. Importance sampling techniques artificially increase the likelihood of
failure trajectories by sampling from a proposal distribution. We discuss several
variations of importance sampling and conclude by presenting a nonparametric
algorithm that estimates the probability of failure from a sequence of samples.

7.1 Direct Estimation

The probability of failure for a given system and specification is defined mathe-
matically as
Z
1
Note that the right-hand side of
equation (7.1) is equivalent to the
pfail = E t ⇠ p(·) [1{t 62 y}] = 1{t 62 y} p(t ) dt (7.1) denominator in equation (6.1). In
other words, the probability of fail-
where 1{·} is the indicator function. The expectation is taken over the nominal ure is the normalizing constant for
the failure distribution.
trajectory distribution for the system.1 Given a set of m trajectories from this
distribution, we can produce an estimate p̂fail of the probability of failure by
treating the problem as a parameter learning problem, where where the parameter
of interest is the parameter of a Bernoulli distribution. We can then apply the
maximum likelihood or Bayesian methods from chapter 2 to calculate p̂fail .
108 chap ter 7. failure probability estim ation

7.1.1 Maximum Likelihood Estimate


The maximum likelihood estimate of the probability of failure is
m
1 n
p̂fail = Â 1{ti 62 y} = (7.2)
m i =1
m

where n is the number of samples that resulted in a failure and m is the total
number of samples. Algorithm 7.1 uses direct sampling to implement this estima-
tor. It performs m rollouts and computes the probability of failure according to
equation (7.2).

struct DirectEstimation Algorithm 7.1. The direct estima-


d # depth tion algorithm for estimating the
m # number of samples probability of failure. The algo-
end rithm performs rollouts to a depth
d to generate m samples from the
function estimate(alg::DirectEstimation, sys, ψ) nominal trajectory distribution. It
d, m = alg.d, alg.m then applies equation (7.2) to com-
τs = [rollout(sys, d=d) for i in 1:m] pute p̂fail and returns the result.
return mean(isfailure(ψ, τ) for τ in τs)
end

We can evaluate the accuracy of an estimator using metrics such as bias, consis-
tency, and variance (example 7.1). Equation (7.2) provides an empirical estimate
of the probability of failure by computing the sample mean of a set of samples
drawn from a Bernoulli distribution with parameter pfail . The sample mean is an
unbiased estimator of the true mean of a Bernoulli distribution, so the estimator is
unbiased. We can calculate the variance of this estimator by dividing the variance
of a Bernoulli distribution by the number of samples:

pfail (1 pfail )
Var[ p̂fail ] = (7.3)
m
The square root of this quantity is known as the standard error of the estimator. A
lower variance means that the sample mean will be closer to the true mean on
average and therefore indicates a more accurate estimator. In the limit of infinite
samples, the variance approaches zero, so the estimator is consistent.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.1. direct estimation 109

Bias, consistency, and variance are three common properties used to evaluate Example 7.1. Common metrics
used to evaluate estimators. The
the quality of an estimator. An estimator that produces p̂fail is unbiased if it plots show predictions of three dif-
predicts the true value in expectation: ferent estimators with shaded re-
gions to represent the variance.
E [ p̂fail ] = pfail

An estimator is consistent if it converges to the true value in the limit of infinite


samples:
lim p̂fail = pfail
m!•

For example, given a set of samples drawn independently from the same
distribution, the sample mean is an unbiased and consistent estimator of
the distribution’s true mean. The variance of the estimator quantifies the
spread of the estimates around the true value. For the sample mean example,
the variance will decrease as the number of samples increases. The plots
below illustrate these concepts. The shaded regions reflect the variance of
the estimator.

Unbiased: 7 Unbiased: 3 Unbiased: 3


Consistent: 3 Consistent: 7 Consistent: 3

high
variance
p̂fail

low
variance

m m m

In general, we want to use an estimator that is unbiased, consistent, and has


low variance. However, we are sometimes forced to trade off between these
metrics to achieve the best efficiency for complex problems.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
110 chap ter 7. failure probability estim ation

Equation (7.3) provides insight on the accuracy of the estimator in equa-


tion (7.2). However, it is expressed in terms of the true probability of failure
pfail , which is the quantity we want to estimate. Therefore, we cannot apply this
equation to directly assess the accuracy of the output of algorithm 7.1. We can
instead use it reason about qualitative trends. For example, equation (7.3) indi-
cates that we decrease the variance of our estimator by collecting more samples.
Example 7.2 illustrates this trend on the grid world problem.

We demonstrate the effect of equation (7.3) empirically by running 10 trials Example 7.2. The empirical mean
and variance of the direct estima-
of algorithm 7.1 on the grid world problem. We compute the empirical mean tor for the grid world problem
and variance of p̂fail across all 10 trials after each new sample. The plot below computed over 10 trials of algo-
rithm 7.1. The depth d is set to 50
shows the results of this experiment. and the probability of slipping is
2 set to 0.8. The blue line shows the
⇥10
2 mean of p̂fail for all 10 trials, and
the shaded region represents one
standard deviation above and be-
1.5
low the mean.
p̂fail

0.5

0
0 1,000 2,000 3,000 4,000 5,000
m

As predicted by equation (7.3), the variance decreases as the number of


samples m increases.

In addition to the number of samples, the true probability of failure pfail also
has an impact on the relative accuracy of the estimator. As the true probability
of failure decreases, the number of samples required to achieve a given level of
accuracy increases (see exercise 7.1). For systems in which failure events are rare,
we may require a large number of samples to produce an accurate estimate for
the probability of failure using algorithm 7.1. Section 7.2 introduces importance
sampling, which can be used to improve the efficiency in these scenarios.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.1. direct estimation 111

7.1.2 Bayesian Estimate


Bayesian failure probability estimation may improve accuracy in scenarios with
limited data or rare failure events. For example, suppose we want to estimate the
probability of an aircraft collision from an aviation safety database that contains
flight records from the past week. If there are no recorded midair collisions for
the past week, the maximum likelihood estimate for the probability of a midair
collision would be zero. Believing that there is zero chance of a midair collision is
not a reasonable conclusion unless our prior hypothesis was, for example, that all
flights were perfectly safe.
Bayesian estimation techniques incorporate a prior belief about the safety of the
system and maintain a full distribution over the probability of failure. Since p̂fail is
the parameter of a Bernoulli distribution, the distribution over the probability of
failure is a beta distribution. The posterior distribution after observing n failures
in m samples is
pfail ⇠ Beta(a + n, b + m n) (7.4)
if we start with a prior of Beta(a, b).
Algorithm 7.2 implements this estimator given a prior distribution. It performs
m rollouts and computes the posterior distribution over the probability of failure
according to equation (7.4). The prior distribution should be selected to reflect
our prior beliefs about the probability of failure based on domain knowledge. If
we do not have any reason to believe that one value of pfail is more probable than
another value in the absence of data, we can use a uniform prior of Beta(1, 1).
Figure 7.1 shows an example of how the posterior distribution changes as more
samples are collected. We can convert the distribution over the probability of
failure into a point estimate by computing its mean or mode. The mean of the
distribution Beta(a, b) is
a
(7.5)
a+b
and the mode is
a 1
(7.6)
a+b 2
assuming a and b are greater than 1.
Maintaining a distribution over the probability of failure allows us to explictly
quantify the uncertainty in our estimate. For example, suppose we have a target
level of safety corresponding to a probability of failure less than or equal to d.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
112 chap ter 7. failure probability estim ation

struct BayesianEstimation Algorithm 7.2. The Bayesian esti-


prior::Beta # from Distributions.jl mation algorithm for estimating a
d # depth distribution over the probability
m # number of samples of failure. The algorithm performs
end rollouts to a depth d to generate m
samples from the nominal trajec-
function estimate(alg::BayesianEstimation, sys, ψ) tory distribution. Using the prior,
prior, d, m = alg.prior, alg.d, alg.m it then applies equation (7.4) to
τs = [rollout(sys, d=d) for i in 1:m] compute the posterior distribution
n, m = sum(isfailure(ψ, τ) for τ in τs), length(τs) over the probability of failure and
return Beta(prior.α + n, prior.β + m - n) returns the result.
end

prior Figure 7.1. Bayesian estimation ap-


plied to the grid world problem
10 1 sample, 0 failures
with a probability of slipping set
5 samples, 1 failure
to 0.5. We begin with a uniform
10 samples, 1 failure prior of Beta(1, 1) and determine a
100 samples, 15 failures distribution over the probability of
p( pfail )

failure by applying algorithm 7.2


5 with 1, 5, 10, and 100 samples. As
we observe more samples, the dis-
tribution over the probability of
failure becomes more concentrated
around a small range of probabili-
0 ties.
0 0.2 0.4 0.6 0.8 1
pfail

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.2. importance sampling 113

We are interested in the quantity p( pfail < d), which is the probability that the
true probability of failure is less than or equal to d. This quantity is given by the
cumulative distribution function of the posterior distribution over the probability
of failure.2 The quantiles of the posterior distribution can be used to compute 2
The cumulative distribution func-
confidence intervals in a similar manner. Example 7.3 demonstrates this process. tion of a Beta distribution is the
regularized incomplete beta func-
tion. Software packages such as
Distributions.jl provide imple-
7.2 Importance Sampling mentations of both the cumulative
distribution function and the quan-
Importance sampling algorithms increase the efficiency of sampling-based estima- tile function for the Beta distribu-
tion.
tion techniques. Instead of sampling from the nominal trajectory distribution
p, they sample from a proposal distribution q that assigns higher likelihood to
areas of greater ‘‘importance.’’3 To estimate the probability of failure using these 3
This proposal distribution has
similar properties to the proposal
samples, we must transform the expectation in equation (7.1) to an expectation
distribution introduced in sec-
over q: tion 6.2 for rejection sampling.
pfail = E t ⇠ p(·) [1{t 62 y}]
Z
= p(t )1{t 62 y}dt
Z
q(t )
= p(t ) 1{t 62 y}dt
q(t ) (7.7)
Z
p(t )
= q(t ) 1{t 62 y}dt
q(t )

p(t )
= E t ⇠q(·) 1{ t 6 2 y }
q(t )
For equation (7.7) to be valid, we require that q(t ) > 0 wherever p(t )1{t 62
y} > 0. This condition is satisfied as long as the proposal distribution assigns a
nonzero likelihood to all failure trajectories that are possible under p.
Given samples from q(·), we can estimate the probability of failure based on
equation (7.7) as
1 m p(ti )
(7.8)
m iÂ
p̂fail = 1{ti 62 y}
=1
q(ti )
Algorithm 7.3 implements this estimator. Equation (7.8) is an unbiased estimator
of the true probability of failure. It corresponds to a weighted average of samples
from the proposal distribution:
m
1
p̂fail = Â wi 1{ti 62 y} (7.9)
m i =1

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
114 chap ter 7. failure probability estim ation

Suppose that we run algorithm 7.2 on the collision avoidance problem with Example 7.3. Quantifying uncer-
tainty in the probability estimate
m = 100 samples and observe no failures. Assuming we begin with a uni- produced by algorithm 7.2. The
form prior, the posterior distribution over the probability over failure is plots show the posterior distribu-
Beta(1, 101). Suppose we are also given a safety requirement for the sys- tion Beta(1, 101). The shaded re-
tem stating that pfail must not exceed 0.01. We can compute p( pfail < 0.01) gion in the first plot represents the
from the cumulative distribution function of the beta distribution using the probability that the true probabil-
ity of failure is less than or equal
following code: to 0.01. The shaded region in the
using Distributions second plot shows the 95 % confi-
posterior = Beta(1, 101) dence bound.
confidence = cdf(posterior, 0.01)

The confidence variable is equal to 0.6376, indicating that we are 63.76 %


confident that the true probability of failure is less than 0.01.
Suppose we instead want to determine a 95 % confidence bound on the
probability of failure. We can compute this bound using the quantile function
of the beta distribution as follows:
bound = quantile(posterior, 0.95)

The bound variable is equal to 0.0292, so we can be 95 % confident that the


true probability of failure is less than 0.0292. The plots below show these
results.
100 100
p( pfail )

p( pfail )

50 50

0.64 0.95
0 0
0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04
pfail pfail

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.2. importance sampling 115

where the weights are wi = p(ti )/q(ti ). These weights are sometimes referred to
as importance weights. Trajectories that are more likely under the nominal trajectory
distribution have higher importance weights.

struct ImportanceSamplingEstimation Algorithm 7.3. The importance


p # nominal distribution sampling estimation algorithm for
q # proposal distribution estimating the probability of fail-
m # number of samples ure. The algorithm generates m
end samples from the proposal distri-
bution q. It then computes the im-
function estimate(alg::ImportanceSamplingEstimation, sys, ψ) portance weights for the samples
p, q, m = alg.p, alg.q, alg.m and applies equation (7.8) to com-
τs = [rollout(sys, q) for i in 1:m] pute p̂fail .
ps = [pdf(p, τ) for τ in τs]
qs = [pdf(q, τ) for τ in τs]
ws = ps ./ qs
return mean(w * isfailure(ψ, τ) for (w, τ) in zip(ws, τs))
end

7.2.1 Optimal Proposal Distribution


The accuracy and efficiency of importance sampling approaches is highly depen-
dent on the proposal distribution. The variance of the estimator in equation (7.8)
is 
1 / y} q(t ) pfail )2
( p ( t )1{ t 2
Var[ p̂fail ] = E t ⇠q(·) (7.10)
m q(t )
In general, we want to select a proposal distribution that makes this variance low,
and the optimal proposal distribution is the one that minimizes this variance.
It is evident from equation (7.10) that we can achieve a variance of zero when

p ( t )1{ t 2 / y}
q⇤ (t ) = (7.11)
pfail

This distribution corresponds to the failure distribution p(t | t 2 / y). As noted in


chapter 6, computing this distribution is not possible in practice since we often
do not know the full set of failure trajectories and the normalizing constant pfail is
the quantity we are trying to estimate. Our goal is therefore to select a proposal
distribution that is as close as possible to the failure distribution.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
116 chap ter 7. failure probability estim ation

7.2.2 Proposal Distribution Selection


One way to select a proposal distribution for importance sampling is based on
domain knowledge. For example, if we know that collisions between aircraft tend
to occur more often when they have high vertical rates, we may select a proposal
distribution that assigns higher likelihood to high vertical rates. It is important
to ensure that the proposal distribution has adequate overlap with the failure
distribution. In other words, it should assign high likelihood to likely failure
trajectories. A poorly selected proposal distribution can lead to high variance and
result in poor performance (see example 7.4).
We can also select a proposal distribution based on samples from the failure
distribution. Specifically, we can approximate the failure distribution by fitting a qfit (t )
distribution to samples obtained using the algorithms in chapter 6. The resulting
distribution will approximate the optimal proposal distribution and may improve
the performance over a hand-designed proposal distribution (see figure 7.2). The p(t )

efficacy of this approach, however, is dependent on our ability to produce a good


fit to the failure distribution, which may be difficult in high-dimensional spaces 4 2 0 2 4
with multiple failure modes. t

Figure 7.2. Fitting a proposal distri-


7.2.3 Multiple Importance Sampling bution to samples from the failure
distribution. The plot shows a his-
We can also draw samples from multiple proposal distributions and combine them togram of samples from the failure
distribution produced using rejec-
to form a more robust estimate. This approach is known as multiple importance tion sampling. We fit a Gaussian
sampling (MIS). Suppose we draw m samples such that distribution to these samples (red)
using maximum likelihood estima-
tion to use as a proposal distribu-
ti ⇠ qi (·) for all i 2 {1, . . . , m} (7.12)
tion. The nominal distribution p(t )
is shown in black.
where qi (·) is the proposal distribution used to generate the ith sample ti . We can
still use equation (7.9) to estimate the probability of failure for MIS, but we must
modify the importance weights to account for multiple proposal distributions.
Several weighting schemes will result in an unbiased estimate. Algorithm 7.4
implements multiple importance sampling with two different weighting schemes.4 4
For a detailed discussion, see V.
The first weighting scheme is Elvira, L. Martino, D. Luengo, and
M. F. Bugallo, “Generalized Multi-
p(ti )
wi = (7.13) ple Importance Sampling,” Statis-
qi (ti ) tical Science, vol. 34, no. 1, pp. 129–
155, 2019.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.2. importance sampling 117

Consider the simple Gaussian problem shown below where failures occurs Example 7.4. Performance com-
parison of two hand-designed pro-
at values less than 2 (red shaded region). The plot on the left shows two posal distributions for the simple
proposal distributions we could use for importance sampling. The first pro- Gaussian problem where failures
occur at values less than 2 (red re-
posal distribution q1 shifts the nominal distribution toward the failure region gion). The first plot shows the nom-
and assigns high likelihood to likely failure trajectories. The second proposal inal distribution and two possible
distribution q2 is shifted toward the failure region but it still does not assign proposal distributions. The second
plot shows the estimation error
high likelihood to likely failures. Therefore, we expect q1 to result in better for direct estimation compared to
estimates than q2 . the estimation error of importance
sampling for the two distributions.
The plot on the right shows the estimation error when performing impor-
tance sampling with each proposal distribution compared to direct estimation.
The shaded region represents the 90 % empirical confidence bounds on the
error. As expected, q1 results in a lower estimation error and a lower vari-
ance than q2 and direct estimation. Performing importance sampling with q2
results in worse performance than direct estimation.

0.06
q2 ( t )
Absolute Estimation Error

0.04
q1 ( t )
p(t )

0.02

0
4 2 0 2 4 0 200 400 600 800 1,000
t Number of Samples

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
118 chap ter 7. failure probability estim ation

struct MultipleImportanceSamplingEstimation Algorithm 7.4. The multiple impor-


p # nominal distribution tance sampling algorithm for esti-
qs # proposal distributions mating the probability of failure.
weighting # weighting scheme: ws = weighting(p, qs, τs) The algorithm generates a sample
end for each proposal distribution in qs.
It then computes the importance
smis(p, qs, τs) = [pdf(p, τ) / pdf(q, τ) for (q, τ) in zip(qs, τs)] weights using the weighting func-
dmmis(p, qs, τs) = [pdf(p, τ) / mean(pdf(q, τ) for q in qs) for τ in τs] tion provided and applies equa-
tion (7.9) to compute p̂fail . The
function estimate(alg::MultipleImportanceSamplingEstimation, sys, ψ) smis and dmmis functions imple-
p, qs, weighting = alg.p, alg.qs, alg.weighting ments the s-MIS and DM-MIS
τs = [rollout(sys, q) for q in qs] weighting schemes respectively.
ws = weighting(p, qs, τs)
return mean(w * isfailure(ψ, τ) for (w, τ) in zip(ws, τs))
end

where the weight for each sample is computed using only the proposal that was
used to generate it. This weighting scheme, which we refer to as standard MIS (s-
MIS), is most similar to the importance sampling estimator for a single proposal
distribution (equation (7.8)).
Instead of considering each proposal individually, we can also view the sam-
ples as if they were drawn in a deterministic order from a mixture distribution
composed of all proposal distributions. This paradigm leads to the deterministic
mixture weighting scheme (DM-MIS):

p(ti )
wi = 1 m
(7.14)
m  j=1 q j (ti )

The denominator of equation (7.14) corresponds to the probability density of the


mixture distribution evaluated at ti .
Figure 7.3 visualizes the weighting schemes, and example 7.5 compares their
performance on a two-dimensional Gaussian problem. While both schemes are
unbiased, DM-MIS has been shown to have lower variance than s-MIS. However,
DM-MIS requires computing the likelihood of each sample under all proposal
distributions, which may be computationally expensive if the number of proposal
distributions is large.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.3. adaptive importance sampling 119

Proposal Distributions s-MIS DM-MIS Figure 7.3. Visualization of the


two most common multiple impor-
tance sampling (MIS) weighting
q1 ( t ) q1 ( t ) q1 ( t ) schemes. In this case, we have two
proposal distributions, q1 and q2 ,
qmix (t ) and we want to determine the im-
portance weight for t2 , which was
q2 ( t ) q2 ( t ) q2 ( t ) sampled from q2 . In s-MIS, we con-
sider only q2 , while in DM-MIS, we
consider the mixture distribution
t2 t2 t2 of q1 and q2 .

t t t

7.3 Adaptive Importance Sampling

Adaptive importance sampling algorithms automatically tune a proposal or set of


proposals to help alleviate the challenge of designing an effective set of proposals
by hand. These algorithms use samples to iteratively adapt the proposal distri-
butions to move toward the failure distribution. In this section, we present two
common adaptive importance sampling algorithms: the cross entropy method
and population Monte Carlo. The cross entropy method adapts a single proposal
distribution, while population Monte Carlo adapts a set of proposal distributions.

7.3.1 Cross Entropy Method


The cross entropy method iteratively fits a proposal distribution using samples.5 The 5
For a detailed overview, see P.-T.
algorithm requires selecting a form for the proposal distribution that is described De Boer, D. P. Kroese, S. Mannor,
and R. Y. Rubinstein, “A Tutorial
by a set of parameters. A common choice is a multivariate Gaussian distribution, on the Cross-Entropy Method,” An-
which is parameterized by a mean vector and a covariance matrix. The goal of nals of Operations Research, vol. 134,
pp. 19–67, 2005.
the cross entropy method is to find the set of parameters that minimizes the
cross entropy between the proposal distribution and the failure distribution. Cross
entropy is an idea used in information theory that provides a measure of distance
between two probability distributions.
We can determine an approximate solution to this minimization problem
using samples drawn from an initial proposal distribution q. For many common
distribution types, minimizing the cross entropy is equivalent to computing the
weighted maximum likelihood estimate with weights based on the proposal

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
120 chap ter 7. failure probability estim ation

Suppose we want to estimate the probability of failure for the two- Example 7.5. Performance com-
parison of importance sampling to
dimensional Gaussian system shown below. The nominal distribution is multiple importance sampling for
a multivariate Gaussian distribution with a mean at the center of the figure, a two-dimensional Gaussian prob-
lem. The plot on the left shows a
and the failure region is composed of the two shaded red regions. The plots single proposal distribution that
below show the log density of both distributions. can be used for importance sam-
pling on this problem, while the
Nominal Distribution Failure Distribution plot on the right shows a set of pro-
posal distributions that can be used
for MIS. The plot below shows
the estimation error for IS com-
pared to the estimation error of
MIS with the two different weight-
t2

t2
ing schemes.

⇥10 6
8
IS

Estimation Error
t1 t1 6 s-MIS
DM-MIS
4
Most of the probability mass for the failure distribution is concentrated in the
central corners of the two modes, and a good proposal distribution should 2
assign high likelihood in those areas. If we only use one multivariate Gaus- 0
0 500 1,000
sian proposal distribution, we need to select a wide distribution to ensure
Number of Samples
that it covers both failure modes (left). We can improve performance by
selecting multiple proposal distributions that together cover both failure
modes (right). The plot in the caption compares the performance of impor-
tance sampling (IS) to multiple importance sampling with the two different
weighting schemes. The shaded region represents the 90 % empirical confi-
dence bounds on the error. The DM-MIS weighting scheme results in better
performance than the s-MIS weighting scheme.

IS Proposal MIS Proposals


t2

t2

t1 t1

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.3. adaptive importance sampling 121

density and failure density.6 In these cases, the weight for a given sample t drawn 6
Minimizing the cross entropy is
from the distribution q is equivalent to weighted maximum
likelihood estimation for distribu-
1{ t 2/ y} p(t )
w= (7.15) tions in the natural exponential
q(t ) family. The natural exponential
family includes many common dis-
where p is the nominal trajectory distribution. tributions such as the Gaussian, ge-
If failures are rare under the initial proposal distribution, it is possible that no ometric, exponential, categorical,
and Beta distributions.
samples will be failures, and the weights computed in equation (7.15) will all be
zero. To address this challenge, the cross entropy algorithm iteratively solves a
relaxed version of the problem that relies on an objective function f . Similar to
the objective functions introduced in section 4.5, the objective function should
assess how close a trajectory is to a failure.7 The objective value must be greater 7
For example, an objective func-
than zero for trajectories that are not failures and less than or equal to zero for tion for the aircraft collision avoid-
ance problem might output the
failure trajectories. For systems with temporal logic specifications, we can use the miss distance between the two air-
robustness as the objective function. craft.
We can rewrite the goal of the cross entropy method in terms of the objective
function as finding the set of parameters that minimizes the cross entropy between
the proposal distribution and p(t | f (t )  0). For systems with rare failure
events, we gradually make progress toward this goal by solving a series of relaxed
problems where we instead minimize the cross entropy between the proposal
and p(t | f (t )  g) for a given threshold g > 0. The weights used in maximum
likelihood estimation for the relaxed problem are

t2
1{ f ( t )  g } p ( t )
w= (7.16)
q(t )

At each iteration, we select the threshold g based on our current set of samples to
t1

ensure that a fraction of the weights will be nonzero (figure 7.4). Figure 7.4. Threshold selection for
Algorithm 7.5 implements the cross entropy method. At each iteration, we a two-dimensional Gaussian prob-
lem with two failures modes. The
draw samples from the current proposal distribution and compute their objective red shaded region shows the fail-
values. We then select the threshold g as the highest objective value from a set ure region. None of the current
of elite samples. The elite samples are the melite samples with the lowest objective samples overlap with the failure re-
gion, so we relax the problem by
values. Since our ultimate goal is to approach the failure distribution, we ensure expanding to the blue region that
that the threshold does not become negative by clipping it at zero. Given this contains a desired fraction of the
samples. The blue samples are the
threshold, we compute the weights using equation (7.16) and fit a new proposal top 10 % of samples with the lowest
distribution to the samples. After repeating this process for a fixed number of objective values.
iterations, algorithm 7.5 performs importance sampling (algorithm 7.3) with the

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
122 chap ter 7. failure probability estim ation

struct CrossEntropyEstimation Algorithm 7.5. The cross entropy


p # nominal trajectory distribution method for estimating the proba-
q₀ # initial proposal distribution bility of failure. At each iteration,
f # objective function f(τ, ψ) the algorithm draws trajectory sam-
k_max # number of iterations ples from the current proposal dis-
m # number of samples per iteration tribution and computes their objec-
m_elite # number of elite samples tive values. It then sorts the sam-
end ples by objective value and uses
the m_elite samples with the low-
function estimate(alg::CrossEntropyEstimation, sys, ψ) est objective values to compute a
k_max, m, m_elite = alg.k_max, alg.m, alg.m_elite threshold value. Using the thresh-
p, q, f = alg.p, alg.q₀, alg.f old, the algorithm computes the
for k in 1:k_max weights and fits a new proposal
τs = [rollout(sys, q) for i in 1:m] distribution to the samples. The
Y = [f(τ, ψ) for τ in τs] fit function is specific to the type
order = sortperm(Y)
of proposal distribution used and
γ = max(0, Y[order[m_elite]])
should perform weighted maxi-
ps = [pdf(p, τ) for τ in τs]
mum likelihood estimation. Af-
qs = [pdf(q, τ) for τ in τs]
ter k_max iterations, the algorithm
ws = ps ./ qs
calls algorithm 7.3 to produce an
ws[Y .> γ] .= 0
q = fit(typeof(q), τs, ws=ws) estimate of the probability of fail-
end ure using the final proposal distri-
return estimate(ImportanceSamplingEstimation(p, q, m), sys, ψ) bution.
end

Iteration 1 Iteration 2 Iteration 5 Iteration 20 Figure 7.5. Visualization of the


cross entropy method on a prob-
lem with a single failure mode (top
row) and two failure modes (bot-
tom row). The proposal distribu-
tion takes the form of a multivari-
t2

ate Gaussian distribution, and the


blue samples are the elite samples
for each iteration. For a single fail-
ure mode, the threshold reaches
the failure region and is able to
approximate the failure distribu-
tion. For two failure modes, the
t2

proposal distribution must become


wide to capture both failure modes,
and the algorithm does not per-
form as well.
t1 t1 t1 t1

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.3. adaptive importance sampling 123

final proposal distribution to produce an estimate of the probability of failure.


Figure 7.6 demonstrates the progression of the algorithm on a simple problem.
Some implementations of the cross entropy method increase efficiency by / y)
p(t | t 2

using the samples produced across all iterations of the algorithm to estimate
the probability of failure. They keep track of the proposal distribution used to
generate the samples at each iteration and view the problem as an instance of
MIS. They produce an estimate using the weighting schemes from section 7.2.3. p(t )
It is important to note in this case, however, that the proposal for each iteration
depends on the previous proposal. Since the proposals are not independent from 4 2 0 2
one another, the DM-MIS weighting scheme will no longer be unbiased. t
The performance of the cross entropy algorithm is sensitive to the form of
Figure 7.6. The cross entropy
the proposal distribution. The algorithm may perform poorly if the proposal method for a one-dimensional
distribution is not expressive enough to adequately capture the shape of the Gaussian problem. The plot shows
the Gaussian proposal distribution
failure distribution. This behavior is particularly apparent for complex systems at each iteration of the algorithm
with high-dimensional, multimodal failure distributions. For example, if we with darker distributions repre-
select a Gaussian proposal distribution for a system with two failure modes, the senting later iterations. The distri-
butions start at the nominal distri-
algorithm will struggle to find a proposal distribution that captures both failure bution and gradually move toward
modes (figure 7.5). One solution is to use a mixture of Gaussians for multimodal the failure distribution (red).
failure distributions (figure 7.7), but this approach requires knowing the number
of failure modes in advance, which is often not possible in practice. In these cases,
an adaptive MIS approach such as population Monte Carlo may perform better.

4 2 0 2 4
7.3.2 Population Monte Carlo t

Population Monte Carlo (PMC) is an adaptive MIS algorithm that maintains a set, Figure 7.7. Example of a Gaussian
or population, of proposal distributions (algorithm 7.6).8 Figure 7.8 shows a single mixture model proposal distribu-
tion for a one-dimensional prob-
step of the algorithm. We begin with an initial population of m proposals that is
lem with two failure modes. The
spread across the space of proposal distributions. For example, we could use a proposal distribution is a mixture
set of multivariate Gaussian distributions with a fixed covariance and different of two Gaussians (blue) that ap-
proximates the multimodal failure
means. It is important to ensure that the initial population is sufficiently diverse distribution (red).
to capture all failure modes. 8
O. Cappé, A. Guillin, J.-M. Marin,
At each iteration, the algorithm draws a single sample from each proposal dis- and C. P. Robert, “Population
Monte Carlo,” Journal of Compu-
tribution in the population. It then computes a weight for each sample in the same tational and Graphical Statistics,
way the weights are computed for the cross entropy method (equation (7.15)). vol. 13, no. 4, pp. 907–929, 2004.
Samples in regions of high likelihood under the failure distribution will receive
higher weights. To adapt the proposal distributions, PMC uses the weights to

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
124 chap ter 7. failure probability estim ation

struct PopulationMonteCarloEstimation Algorithm 7.6. The population


p # nominal trajectory distribution Monte Carlo algorithm for estimat-
qs # vector of initial proposal distributions ing the probability of failure. At
weighting # weighting scheme: ws = weighting(p, qs, τs) each iteration, the algorithm draws
k_max # number of iterations trajectory samples from the pro-
end posal distributions in the popula-
tion, computes their weights, and
function estimate(alg::PopulationMonteCarloEstimation, sys, ψ) resamples to produce new pro-
p, qs, weighting = alg.p, alg.qs, alg.weighting posal distributions. The proposal
k_max, m = alg.k_max, length(qs) function is specific to the trajec-
for k in 1:k_max tory distribution and creates a pro-
τs = [rollout(sys, q) for q in qs] posal distribution from a sample.
ws = [pdf(p, τ) * isfailure(ψ, τ) / pdf(q, τ) After k_max iterations, the algo-
for (q, τ) in zip(qs, τs)] rithm calls algorithm 7.4 using the
resampler = Categorical(ws ./ sum(ws)) specified weighting scheme to pro-
qs = [proposal(qs[i], τs[i]) for i in rand(resampler, m)]
duce an estimate of the probability
end
of failure using the final set of pro-
mis = MultipleImportanceSamplingEstimation(p, qs, weighting)
posal distributions.
return estimate(mis, sys, ψ)
end

Initial Proposals Sampling Weighting Resampling Figure 7.8. One iteration of the
population Monte Carlo algorithm.
For the weighting step, gray sam-
ples have zero weight, and the size
of the blue samples is proportional
to their weight. The resampling
t2

step produces new proposal distri-


butions that are centered in high
likelihood regions of the failure dis-
tribution.
t1 t1 t1 t1

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.4. sequential monte carlo 125

perform a resampling step. In this step, we redraw m samples from the population
of samples with probability proportional to their weights. We then reconstruct the
population of proposal distributions using the resulting samples. For example,
if we are using proposals in the form of multivariate Gaussian distributions, we
could create new proposals with the same fixed covariance and means centered
at each sample.
Over time, the population of proposal distributions should cover high likeli-
hood regions of the failure distribution. After a fixed number of iterations, we
perform MIS using the final population to estimate the probability of failure. We
can use either of the weighting schemes from section 7.2.3 to produce the estimate.
Similar to the cross entropy method, we could instead use the samples produced
during all iterations of the algorithm to estimate the probability of failure, noting
that the estimate in this case may no longer be unbiased.
Using multiple proposal distributions allows us to represent complex, multi-
modal failure distributions. However, the performance of PMC is still dependent
on the number of proposal distributions and their ability to cover the space of
possible proposals. If the number of proposal distributions is too small or the ini-
tial proposals are not sufficiently diverse, the algorithm may miss failures modes
and produce an inaccurate estimate. Furthermore, the stochastic nature of the
resampling procedure can lead to a loss of diversity in the proposal distributions
over time. For example, the proposals may collapse to a single failure mode or a
subset of the failure modes.

7.4 Sequential Monte Carlo

The sampling, weighting, and resampling components of algorithm 7.6 form the
basis for a more general framework used in the field of Bayesian inference called
sequential Monte Carlo (SMC).9 In SMC, we start with samples from the nominal 9
SMC is also known as particle fil-
trajectory distribution and gradually adapt these samples to move toward the tering in the context of state es-
timation. M. S. Arulampalam, S.
failure distribution. We then use the path of each sample to estimate the probability Maskell, N. Gordon, and T. Clapp,
of failure. “A Tutorial on Particle Filters for
Online Nonlinear/non-Gaussian
One way to adapt the samples in SMC is to move them through a sequence of Bayesian Tracking,” IEEE Transac-
intermediate distributions that gradually transition from the nominal distribu- tions on Signal Processing, vol. 50,
tion to the failure distribution. Specifically, we create a sequence of distributions no. 2, pp. 174–188, 2002.

g1 , g2 , . . . , gn where g1 is the nominal trajectory distribution and gn is the failure


distribution. Figure 7.9 illustrates two methods for selecting the intermediate

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
126 chap ter 7. failure probability estim ation

Smoothing Thresholding Figure 7.9. Two methods for select-


ing intermediate distributions for
SMC. The distributions gradually
p(t ) p(t ) transition from the nominal trajec-
tory distribution p(t ) (blue) to the
failure distribution p(t | t 2 / y)
(red). The first method uses the
/ y)
p(t | t 2 / y)
p(t | t 2 smoothing technique introduced
in section 6.3.2. The second method
uses the thresholding technique in-
t t troduced in section 7.3.1 with ob-
jective function f (t ) = t + 2.

distributions. The first method uses the smoothing technique introduced in sec-
tion 6.3.2. We can move from the nominal distribution to the failure distribution by
gradually decreasing the value of the standard deviation e in the smoothed den-
sity.10 The second method uses the same thresholding technique used in the cross 10
A similar technique is the expo-
entropy method. The intermediate distributions take the form p(t | f (t )  g) nential tilting barrier presented in
A. Sinha, M. O’Kelly, R. Tedrake,
where f (t ) is the objective function and g is a threshold. We move from the and J. C. Duchi, “Neural Bridge
nominal distribution to the failure distribution by gradually decreasing the value Sampling for Evaluating Safety-
Critical Autonomous Systems,” Ad-
of g. vances in Neural Information Pro-
At each iteration of SMC, our goal is to transition samples from the current cessing Systems (NeurIPS), vol. 33,
distribution g` to the next distribution in the sequence g`+1 . We typically only pp. 6402–6416, 2020.

have access to the unnormalized densities of the intermediate distributions in


practice, so MCMC is commonly used to perform this transition. Specifically, we
initialize an MCMC chain at each sample with g`+1 as the target distribution and
run the chain for a fixed number of iterations. We take the final sample in each
chain to form the next set of samples. Figure 7.10 demonstrates this process.
To produce an estimate of the probability of failure from the MCMC samples,
we derive a set of importance weights using the joint probability distribution over
the path of each sample as the proposal distribution.11 The importance weight of 11
A full derivation can be found
the ith trajectory after sampling from the distribution g` is given by in F. Llorente, L. Martino, D.
Delgado, and J. Lopez-Santiago,
(`) “Marginal Likelihood Computa-
(` 1) ḡ`+1 ( ti ) tion for Model Selection and Hy-
(7.17)
(`)
wi = wi (`) pothesis Testing: an Extensive Re-
ḡ` (ti ) view,” SIAM Review, vol. 65, no. 1,
pp. 3–58, 2023.
where ḡ` (t ) is equal to p(t ) when ` = 1 and the unnormalized density of the `th
intermediate distribution otherwise. We can obtain an estimate for the probability

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.4. sequential monte carlo 127

Figure 7.10. Adaptation steps in


SMC. The plot on the left shows
the MCMC paths of a single sam-
ple as it transitions from the nom-
inal distribution (blue) to the fail-
ure distribution through a set of
smoothed intermediate distribu-
t2

tions. The plot on the right shows


this process on a set of samples ini-
tially drawn from the nominal dis-
tribution.

t1 t1

of failure using the mean of the weights at the final iteration:


m
1 ( n 1)
p̂fail = Â wi (7.18)
m i =1

The accuracy of the estimator in equation (7.18) depends on how well the
samples at each iteration represent the corresponding intermediate distribution.
If the samples are not representative, the weights will be small, and the estimator
will be inaccurate. However, we may require a large number of MCMC steps
to transition samples from one distribution to the next, especially for samples
that are unlikely under the next distribution. One technique used to address this
challenge is to resample the trajectories based on their importance weights.12 12
P. Del Moral, A. Doucet, and
This step is similar to the resampling step in PMC and tends to result in better A. Jasra, “Sequential Monte Carlo
Samplers,” Journal of the Royal Sta-
coverage of the intermediate distributions (see example 7.6). After resampling, tistical Society Series B: Statistical
we reset the weights to the mean of the weights before resampling to ensure that Methodology, vol. 68, no. 3, pp. 411–
436, 2006.
the estimator in equation (7.18) remains accurate.
Algorithm 7.7 implements SMC given a nominal trajectory distribution and
a set of intermediate distributions. At each iteration, it perturbs the current set
of samples to represent the next distribution in the sequence using MCMC. Ex-
ample 7.7 provides an implementation of this step for the inverted pendulum
problem. The algorithm then updates the importance weights and performs the
resampling step. Finally, it returns an estimate of the probability of failure based
on equation (7.18).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
128 chap ter 7. failure probability estim ation

Consider the scenario shown below in which we want to transition samples Example 7.6. The benefit of resam-
pling in SMC. The first set of plots
from the blue distribution to the purple distribution using 10 MCMC steps illustrates the resampling step, and
per sample. The plots below illustrate the weighting and resampling steps. the second set of plots shows the
improvement in the samples at the
The plot in the middle shows the weights of the samples, with darker points next iteration after performing re-
having higher weights. The plot on the right shows the samples after resam- sampling.
pling according to these weights. After resampling, the samples are more
representative of the purple distribution.

Sample Weight Resample


t2

t1 t1 t1

On the next iteration, we perform MCMC starting at these samples with the
purple distribution as the target to complete the transition. The plots below
show the result of this step with and without resampling. The results without
resampling start the MCMC chains at the blue samples shown above. The
resampling step results in a set of samples that better represents the target
distribution.
Without Resampling With Resampling
t2

t1 t1

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.5. ratio of normaliz ing constants 129

struct SequentialMonteCarloEstimation Algorithm 7.7. The sequential


p # nominal trajectory distribution Monte Carlo algorithm for estimat-
ḡs # intermediate distributions ing the probability of failure. The
perturb # τs′ = perturb(τs, ḡ) algorithm iterates through inter-
m # number of samples mediate distributions and perturbs
end the samples to represent the cur-
rent distribution at each iteration.
function estimate(alg::SequentialMonteCarloEstimation, sys, ψ) It then computes the weights ac-
p, ḡs, perturb, m = alg.p, alg.ḡs, alg.perturb, alg.m cording to the next distribution in
p̄failure(τ) = isfailure(ψ, τ) * pdf(p, τ) the sequence, performs the resam-
τs = [rollout(sys, p) for i in 1:m] pling step, and resets the weights.
ws = [ḡs[1](τ) / p(τ) for τ in τs] The algorithm uses a system spe-
for (ḡ, ḡ′) in zip(ḡs, [ḡs[2:end]...; p̄failure]) cific perturb function to transition
τs′ = perturb(τs, ḡ) samples from one distribution to
ws .*= [ḡ′(τ) / ḡ(τ) for τ in τs′] the next. It returns an estimate of
τs = τs′[rand(Categorical(ws ./ sum(ws)), m)]
the probability of failure based on
ws .= mean(ws)
equation (7.18).
end
return mean(ws)
end

Unlike the algorithms presented in section 7.3, algorithm 7.7 is nonparametric.


We do not need to specify a parametric form for the intermediate distributions.
Instead, we represent them using samples. This flexibility allows us to estimate
the probability of failure for complex, multimodal failure distributions. However,
SMC can run into the same potential problems as PMC. For example, if the MCMC
is not run for long enough on each iteration, the samples may be inaccurate and
miss potential failure modes. The resampling step can also cause a loss of diversity 13
One common resampling
that may result in a collapse of the samples to a single mode. Other weighting scheme that ensures that the
samples remain diverse is called
and resampling schemes may be used to maintain diversity.13 low variance resampling. More
details can be found in Section
4.2.4 of S. Thrun, W. Burgard, and
7.5 Ratio of Normalizing Constants D. Fox, Probabilistic Robotics. MIT
Press, 2006.
Importance sampling is a special case of the more general problem of estimating 14
A detailed survey of techniques
the ratio of the normalizing constants of two distributions.14 By focusing on relating the estimating the ratio
of normalizing constants is pro-
the more general problem, we can derive multiple extensions to importance vided in F. Llorente, L. Martino,
sampling that allow us to use unnormalized proposal densities. Consider two D. Delgado, and J. Lopez-Santiago,
“Marginal Likelihood Computa-
probability distributions g1 and g2 with normalizing constants z1 and z2 such tion for Model Selection and Hy-
R
that g1 (t ) = ḡ1 (t )/z1 and g2 (t ) = ḡ2 (t )/z2 . We have that z1 = ḡ1 (t ) dt and pothesis Testing: an Extensive Re-
view,” SIAM Review, vol. 65, no. 1,
pp. 3–58, 2023.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
130 chap ter 7. failure probability estim ation

To estimate the probability of failure for the inverted pendulum system using Example 7.7. Application of SMC
to the inverted pendulum problem.
SMC, we implement the following function that uses 10 MCMC steps to The plots show the samples from
transition samples between intermediate distributions: the intermediate smoothed failure
function perturb(samples, ḡ) distributions.
function inverted_pendulum_kernel(τ; Σ=0.05^2 * I)
μs, Σs = [step.x.xo for step in τ], [Σ for step in τ]
return PendulumTrajectoryDistribution(τ[1].s, Σ, μs, Σs)
end
k_max, m_burnin, m_skip, new_samples = 10, 1, 1, []
for sample in samples
alg = MCMCSampling(ḡ, inverted_pendulum_kernel, sample,
k_max, m_burnin, m_skip)
mcmc_samples = sample_failures(alg, inverted_pendulum, ψ)
push!(new_samples, mcmc_samples[end])
end
return new_samples
end

We can use a set of smoothed failure distributions as the intermediate distri-


butions. The plot below shows some of the samples from these intermediate
distributions.
e = 0.5 e = 0.2 e = 0.1 e = 0.01
1
q (rad)

1
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Time (s) Time (s) Time (s) Time (s)

Using 1,000 samples per iteration, SMC estimates the probability of failure to
be approximately 0.0001. The direct estimate for probability of failure based
on one million simulations is approximately 0.0005.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.5. ratio of normaliz ing constants 131

R
z2 = ḡ2 (t ) dt, and our goal is to estimate the ratio of the normalizing constants
z1 /z2 using samples from g2 .
First, we rewite z1 in terms of an expectation over g2 :
Z
z1 = ḡ1 (t ) dt (7.19)
Z
g2 ( t )
= ḡ1 (t ) dt (7.20)
g2 ( t )
Z
ḡ (t )
= g2 ( t ) 1 dt (7.21)
ḡ2 (t )/z2
Z
ḡ (t )
= z 2 g2 ( t ) 1 dt (7.22)
ḡ2 (t )

ḡ (t )
= z2 E t ⇠ g2 (·) 1 (7.23)
ḡ2 (t )
Dividing both sides of equation (7.23) by z2 gives us the ratio of the normalizing
constants, which we can approximate using m samples from g2 :

z1 ḡ (t ) 1 m ḡ1 (ti )
= E t ⇠ g2 (·) 1 (7.24)
m iÂ

z2 ḡ2 (t ) ḡ (t )
=1 2 i

where ti ⇠ g2 (·) and g2 (t ) > 0 whenever g1 (t ) > 0. Note that the estimator in 15
We could also use equa-
equation (7.24) only requires evaluating the unnormalized densities ḡ1 (t ) and tion (7.24) to estimate the
reciprocal of the probability of
ḡ2 (t ). Since pfail is the normalizing constant of the failure distribution, we can failure using samples from the
use equation (7.24) to estimate the probability of failure by setting ḡ1 (t ) equal to failure distribution by setting
ḡ2 (t ) equal to the unnormalized
the unnormalized failure density and ḡ2 (t ) equal to any normalized proposal failure density and ḡ1 (t ) equal to
density q(t ). In fact, these choices of ḡ1 (t ) and ḡ2 (t ) cause equation (7.24) to any normalized density whose
reduce to the importance sampling estimator in equation (7.8) (see exercises 7.2 support is contained within the
support of failure distribution.
and 7.3).15 However, selecting ḡ1 (t ) to
If g1 and g2 have little overlap in terms of probability mass, the estimator in satisfy this condition is often
difficult in practice and can lead to
equation (7.24) may perform poorly. One technique to improve performance estimators with infinite variance.
is called umbrella sampling (also known as ratio importance sampling). Umbrella This technique is called reciprocal
sampling introduces a third density, called an umbrella density, that has signif- importance sampling. In general,
this estimator should not be used
icant overlap with both g1 and g2 . We use this density to estimate the ratio of for failure probability estimation.
normalizing constants by applying equation (7.24) twice:
h i
ḡ1 (t ) 1 m ḡ1 (ti )
m Âi =1 ḡu (ti )
z1 z1 /zu E t ⇠ gu (·) ḡu (t )
= = h i ⇡ (7.25)
z2 z2 /zu E
ḡ2 (t ) 1
Âm 2 i
ḡ (t )
t ⇠ gu (·) ḡu (t ) m i =1 ḡu (ti )

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
132 chap ter 7. failure probability estim ation

where ḡu is the unnormalized umbrella density, zu is its normalizing constant,


and the m samples are drawn from gu (·). The optimal umbrella density is

z1
ḡu⇤ (t ) µ ḡ1 (t ) ḡ2 (t ) (7.26)
z2
Similar to the optimal proposal for importance sampling, the optimal umbrella
density is expressed in terms of the quantity we are trying to estimate, so we
cannot compute it exactly. In general, we want to select an umbrella density that
is as close as possible to this density.
Another technique to estimate the ratio of normalizing constants when g1 and
g2 have little overlap is called bridge sampling. Similar to umbrella sampling, bridge
sampling introduces a third density called a bridge density. However, instead of
using samples from this density to estimate the ratio of normalizing constants,
bridge sampling uses samples from both g1 and g2 . Assuming we produce m1
samples from g1 and m2 samples from g2 , we again apply equation (7.24) twice
to obtain the bridge sampling estimator:
h i
ḡb (t ) 1 m2 ḡb (tj )
z1 zb /z2 E t ⇠ g (·) ḡ2 (t ) m2 Â j=1 ḡ2 (tj )
(7.27)
2
= = h i ⇡
z2 zb /z1 E
ḡb (t ) 1 m ḡ (t )
 1 b i
t ⇠ g1 (·) ḡ1 (t ) m1 i =1 ḡ1 (ti )

where ti ⇠ g1 (·), tj ⇠ g2 (·), and ḡb is the bridge density.


The optimal bridge density is

ḡ1 (t ) ḡ2 (t )
ḡb⇤ (t ) µ (7.28)
m1 ḡ1 (t ) + m2 zz12 ḡ2 (t )

which is again written in terms of the quantity we are trying to estimate. Given
samples from both g1 and g2 , we can use a simple iterative procedure to estimate
the optimal bridge density (algorithm 7.8). At each iteration, we apply equa-
tion (7.27) using the current bridge density to estimate the ratio of normalizing
constants. We then plug this ratio into equation (7.28) to obtain a new bridge
density. We repeat this process for a fixed number of iterations.
While umbrella sampling and bridge sampling both introduce a third density to
improve efficiency, they have different properties. For example, umbrella sampling
only requires samples from one density, while bridge sampling requires samples
from two different densities. Furthermore, the optimal umbrella density and the
optimal bridge density are very different (see figure 7.11). The optimal umbrella

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.5. ratio of normaliz ing constants 133

function bridge_sampling_estimator(g₁τs, ḡ₁, g₂τs, ḡ₂, ḡb) Algorithm 7.8. Algorithm for es-
ḡ₁s, ḡ₂s = ḡ₁.(g₁τs), ḡ₂.(g₂τs) timating the optimal bridge den-
ḡb₁s, ḡb₂s = ḡb.(g₁τs), ḡb.(g₂τs) sity ḡb using samples from ḡ₁
return mean(ḡb₂s ./ ḡ₂s) / mean(ḡb₁s ./ ḡ₁s) and ḡ₂. We iteratively apply equa-
end tion (7.27) to estimate the ratio of
normalizing constants and use this
function optimal_bridge(g₁τs, ḡ₁, g₂τs, ḡ₂, k_max) ratio to update the bridge density
ratio = 1.0 using equation (7.28).
m₁, m₂ = length(g₁τs), length(g₂τs)
ḡb(τ) = (ḡ₁(τ) * ḡ₂(τ)) / (m₁ * ḡ₁(τ) + m₂ * ratio * ḡ₂(τ))
for k in k_max
ratio = bridge_sampling_estimator(g₁τs, ḡ₁, g₂τs, ḡ₂, ḡb)
end
return ḡb
end

Optimal Umbrella Optimal Bridge Figure 7.11. Comparison of the


optimal umbrella density and op-
ḡb⇤ (t ) timal bridge density for estimat-
ing the ratio of normalizing con-
ḡ1 (t ) ḡ2 (t ) ḡ1 (t ) ḡ2 (t )
stants between two example distri-
butions.

ḡu⇤ (t )

t t

density covers regions of high likelihood for both distributions, while the optimal
bridge density bridges the gap between the two distributions.

7.5.1 Self-Normalized Importance Sampling


Self-normalized importance sampling (self-IS) is a special case of umbrella sampling
that can be used to estimate the probability of failure given samples from an unnor-
malized density. Specifically, we set ḡ1 (t ) to be the unnormalized failure density,
ḡ2 (t ) to be the nominal trajectory distribution, and ḡu (t ) to be an unnormalized

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
134 chap ter 7. failure probability estim ation

proposal density q̄(t ). These choices lead to the following estimator:


1{ti 62y} p(ti )
pfail
1
m Âim=1 q̄(ti )
1
Âim=1 wi 1{ti 62 y}
⇡ = m
1 m
(7.29)
1 1 m p(ti )
m  i =1 wi
m Âi =1 q̄(ti )

where wi = p(ti )/q̄(ti ) and ti ⇠ q(·). This estimator (algorithm 7.9) is similar to
the estimator in equation (7.9) for normalized proposal distributions with the
extra step of dividing the unnormalized importance weights by their sum.

struct SelfImportanceSamplingEstimation Algorithm 7.9. The self-


p # nominal distribution normalized importance sampling
q̄ # unnormalized proposal density estimation algorithm for esti-
q̄_τs # samples from q̄ mating the probability of failure.
end The algorithm takes as input an
unnormalized proposal density q̄
function estimate(alg::SelfImportanceSamplingEstimation, sys, ψ) along with samples drawn from
p, q̄, q̄_τs = alg.p, alg.q̄, alg.q̄_τs it. It computes the importance
ws = [pdf(p, τ) / q̄(τ) for τ in q̄_τs] weights for the samples and ap-
ws ./= sum(ws) plies equation (7.29) to compute
return mean(w * isfailure(ψ, τ) for (w, τ) in zip(ws, q̄_τs)) p̂fail .
end

The optimal proposal for self-IS is different from the optimal proposal for
p(t )
importance sampling. Based on equation (7.26), the optimal proposal for self-IS
/ y)
p̄(t | t 2
is
q⇤ (t ) µ p(t ) |1{t 62 y} pfail | (7.30) q⇤ (t )

Sampling from this density should result in half of the samples coming from the t
failure distribution and half coming from the success distribution. The optimal
proposal for IS, on the other hand, is the failure distribution itself. Figure 7.12 Figure 7.12. The optimal pro-
posal for self-normalized impor-
shows the optimal proposal distribution for self-IS on a simple Gaussian system. tance sampling for a simple Gaus-
In practice, we can plug a guess for pfail into equation (7.30) to obtain a proposal sian problem with a failure thresh-
old of 1.
distribution that is close to the optimal proposal. However, drawing samples
from this proposal is often difficult in practice, especially for systems with rare
failure events and multiple failure modes (see example 7.8). Furthermore, the
performance of the algorithm tends to be sensitive to incorrect guesses for pfail
when creating the proposal distribution. Bridge sampling, which we discuss in
the next section, is less sensitive to these choices.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.5. ratio of normaliz ing constants 135

Suppose we want to use self-IS to estimate the probability of failure for the Example 7.8. The challenges asso-
ciated with sampling from the op-
simple Gaussian system. We know that the optimal proposal is of the form timal self-IS proposal density. The
plots show the proposal distribu-
q ⇤ ( t ) µ p ( t ) |1{ t 6 2 y } a| tion for three different failure prob-
abilities along with histograms of
samples drawn from these distri-
where a is our guess for the probability of failure. The plots below show the butions using MCMC. As the prob-
proposal distribution for three different values of a along with histograms of ability of failure decreases, the dis-
samples drawn from these distributions using MCMC. For each distribution, tributions become more difficult to
accurately sample from.
we use 5,100 MCMC steps with a burn-in of 100 steps, keeping every 10th
sample.

a = 10 2 a = 10 7 a = 10 12

10 5 0 5 10 10 5 0 5 10 10 5 0 5 10
t t t

As a decreases, the modes of the proposal distribution grow further apart,


and the proposal distribution becomes more difficult to accurately sample
from. For a = 10 12 , the MCMC misses the failure region entirely.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
136 chap ter 7. failure probability estim ation

7.5.2 Bridge Sampling


We can use the bridge sampling estimator in equation (7.27) to estimate the
probability of failure, but it will be inefficient for systems with low failure prob-
abilities. In fact, if we follow the same steps we followed to derive the self-IS
estimator, we will arrive at an estimator that can perform no better than the direct
estimator in section 7.1. This property of the bridge sampling estimator is a result
of the optimal bridge density being zero for all samples that are not failures (see
example 7.9).
To improve performance, we can build upon ideas from SMC (section 7.4)
by performing bridge sampling on a sequence of distributions that gradually
transition from the nominal distribution to the failure distribution.16 Specifically, 16
A. Sinha, M. O’Kelly, R. Tedrake,
we create a sequence of n distributions and represent the `th distribution as and J. C. Duchi, “Neural Bridge
Sampling for Evaluating Safety-
g` (t ) = ḡ` (t )/z` . We set ḡ1 (t ) equal to the density of the nominal trajectory Critical Autonomous Systems,” Ad-
distribution and ḡn (t ) equal to the unnormalized failure density. vances in Neural Information Pro-
cessing Systems (NeurIPS), vol. 33,
We then rewrite the probability of failure as a product of ratios of normalizing pp. 6402–6416, 2020.
constants: ✓ ◆✓ ◆ ✓ ◆ L 1
pfail zn z2 z3 zL z
= = ··· = ’ `+1 (7.31)
1 z1 z1 z2 zL 1 `=1
z`
These ratios can be estimated using the bridge sampling identity in equation (7.27):
1 m ḡb (tj )
n 1 m2 Â j=21 ḡ` (tj )
pfail ⇡ ’ m ḡb (ti )
(7.32)
1
`=1 m
1
Âi=11 ḡ`+1 (ti )

where we draw m1 samples from g`+1 (·) and m2 samples from g` (·). The in-
termediate distributions in the chain should be chosen such that the ratio of
normalizing constants between two consecutive distributions is easy to estimate.
In other words, consecutive intermediate distributions should have significant
overlap with each other. For example, we can create the intermediate distributions
using either of the two methods in figure 7.9.17 17
If we use the thresholding tech-
Algorithm 7.10 implements bridge sampling estimation using a sequence of in- nique, the algorithm reduces to the
multilevel splitting algorithm pre-
termediate distributions. It begins by drawing samples from the nominal trajectory sented in section 7.6.
distribution. At each iteration, it perturbs the samples to match the next interme-
diate distribution and estimates the optimal bridge density using algorithm 7.8.
It then applies equation (7.32) to compute the ratio of normalizing constants
between the two distributions. Finally, the algorithm applies equation (7.31) to
compute an estimate for the probability of failure.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.5. ratio of normaliz ing constants 137

To estimate to probability of failure from equation (7.27), we set ḡ1 (t ) equal Example 7.9. Proof that a bridge
sampling estimator that sets ḡ1 (t )
to the unnormalized failure density and ḡ2 (t ) equal to the density of the to the unnormalized failure den-
nominal trajectory distribution: sity and ḡ2 (t ) to the nominal tra-
jectory distribution can perform no
m ḡb (tj ) better than the direct estimator for
1
pfail m2 Â j=21 p(tj ) estimating the probability of fail-
⇡ ure.
1 1 m
Âi=11
ḡb (ti )
m1 1{ti 62y} p(ti )

Plugging in to equation (7.28) to get the optimal bridge density gives:


8
1{ t 6 2 y } p ( t ) 2 < p(t ) if t 2
/y
gb⇤ (t ) µ µ
m1 1{ t 6 2 y } p ( t ) + m2 p ( t ) :0 otherwise

The optimal bridge density is zero for all samples that are not failures. Since
all the samples from the failure density will be failure samples, we have
m1 m1
1 ḡ (t ) 1 p(ti )
m1 Â 1{ti 62b yi} p(ti ) =
m1 Â p(ti )
=1
i =1 i =1

We also have that


m2 p(tj )
1 n2
m2 Â 1{tj 62 y} p(tj ) =
m2
j =1

where n2 is the number of samples from the nominal trajectory distribution


that were failures. In this case, the bridge sampling estimator reduces to the
direct estimator in equation (7.2). Since we produced this result using the
optimal bridge density, we can conclude that this estimator will not perform
any better than the direct estimator for estimating the probability of failure.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
138 chap ter 7. failure probability estim ation

struct BridgeSamplingEstimation Algorithm 7.10. The bridge sam-


p # nominal trajectory distribution pling algorithm for estimating the
ḡs # intermediate distributions probability of failure. The algo-
perturb # samples′ = perturb(samples, ḡ′) rithm takes as input a nominal
m # number of samples from each intermediate distribution trajectory distribution p and a se-
kb # number of iterations for estimating optimal bridge quence of intermediate densities
end ḡs. At each iteration, the algorithm
performs resampling and uses the
function estimate(alg::BridgeSamplingEstimation, sys, ψ) system-specific perturb function
p, ḡs, perturb, m = alg.p, alg.ḡs, alg.perturb, alg.m to produce samples from the next
p̄failure(τ) = isfailure(ψ, τ) * pdf(p, τ) distribution. It then estimates to op-
τs = [rollout(sys, p) for i in 1:m] timal bridge density and applies
p̂fail = 1.0 equation (7.27) from algorithm 7.8
for (ḡ, ḡ′) in zip([p; ḡs...], [ḡs...; p̄failure]) to compute the ratio of normalizing
ws = [ḡ′(τ) / ḡ(τ) for τ in τs] constants. The final iteration draws
τs′ = τs[rand(Categorical(ws ./ sum(ws)), m)]
samples from the failure distribu-
τs′ = perturb(τs′, ḡ′)
tion and produces an estimate of
ḡb = optimal_bridge(τs′, ḡ′, τs, ḡ, kb)
the failure probability.
ratio = bridge_sampling_estimator(τs′, ḡ′, τs, ḡ, ḡb)
p̂fail *= ratio
τs = τs′
end
return p̂fail
end

The perturb step in algorithm 7.10 produces samples from the next distribution
in the sequence and can be performed using the MCMC algorithms presented in
chapter 6. However, these distributions may be difficult to sample from, especially
as we get closer to the failure distribution. In practice, we can greatly increase
efficiency by using the samples from the previous distribution as a starting point
to produce samples from the next distribution. This process is similar to the
process used in SMC (algorithm 7.7), in which we weight and resample the
trajectories from the previous distribution before applying MCMC. The ability
to adapt samples from the previous distribution is another benefit of using a
sequence of intermediate distributions. Figure 7.13 shows the samples from the
intermediate distributions for the continuum world problem.

7.6 Multilevel Splitting

Multilevel splitting algorithms estimate the probability of failure using a series of 18


H. Kahn and T. E. Harris, “Esti-
conditional distributions.18 Similar to the cross entropy method (section 7.3.1), mation of Particle Transmission by
Random Sampling,” National Bu-
multilevel splitting relies on an objective function f with properties such that the reau of Standards Applied Mathemat-
ics Series, vol. 12, pp. 27–30, 1951.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.6. multilevel splitting 139

p(t ) e = 0.5 e = 0.2 Figure 7.13. Samples from the


smoothed intermediate distribu-
tions when applying bridge sam-
pling to estimate the probability
of failure for the continuum world
problem. These samples result in
a failure probability estimate of
4.93 ⇥ 10 4 . The direct estimate
from one million samples is 2.47 ⇥
10 4 .

e = 0.1 e = 0.01 / y)
p(t | t 2

probability of failure can be written as p( f (t )  0). Given a series of thresholds


g1 > g2 > · · · > gn where g1 = • and gn = 0, we can write the probability of
failure as
n
pfail = p( f (t )  g) = ’ p ( f ( t )  g` | f ( t )  g` 1) (7.33)
`=2

As long as the thresholds gradually decrease in a way that ensures that the condi-
tional probabilities remain large, we can efficiently estimate these intermediate
probabilities using direct estimation.
To ensure that the conditional probabilities remain large, it is common to se-
lect the thresholds adaptively.19 Algorithm 7.11 implements adaptive multilevel 19
F. Cérou and A. Guyader, “Adap-
splitting. Adaptive multilevel splitting begins by drawing samples from the nom- tive Multilevel Splitting for Rare
Event Analysis,” Stochastic Analy-
inal trajectory distribution. At each iteration, it computes the objective value for sis and Applications, vol. 25, no. 2,
each sample and selects a threshold g such that a fixed number of samples have pp. 417–443, 2007.
objective values less than g. It then uses this threshold and the current samples
to estimate p( f (t )  g` | f (t )  g` 1 ).
The algorithm produces the next set of samples by perturbing the current
samples to represent the distribution p(t | f (t )  g` ). As with SMC and bridge

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
140 chap ter 7. failure probability estim ation

struct AdaptiveMultilevelSplitting Algorithm 7.11. The adaptive mul-


p # nominal trajectory distribution tilevel splitting algorithm for esi-
m # number of samples mating the probability of failure.
m_elite # number of elite samples At each iteration, the algorithm
k_max # maximum number of iterations computes the objective value for
f # objective function f(τ, ψ) each sample and selects the next
perturb # τs′ = perturb(τs, p̄γ) threshold. It then estimates the cur-
end rent conditional probability and
perturbs the samples to represent
function estimate(alg::AdaptiveMultilevelSplitting, sys, ψ) the next distribution. The perturb
p, m, m_elite, k_max = alg.p, alg.m, alg.m_elite, alg.k_max function is system specific. The al-
f, perturb = alg.f, alg.perturb gorithm iterates until the threshold
τs = [rollout(sys, p) for i in 1:m] reaches zero. If we reach the maxi-
p̂fail = 1.0 mum number of iterations before
for i in 1:k_max this criterion is met, the algorithm
Y = [f(τ, ψ) for τ in τs]
will force the final threshold to be
order = sortperm(Y)
zero.
γ = i == k_max ? 0 : max(0, Y[order[m_elite]])
p̂fail *= mean(Y .≤ γ)
γ == 0 && break
τs = rand(τs[order[1:m_elite]], m)
p̄γ(τ) = p(τ) * (f(τ, ψ) ≤ γ)
τs = perturb(τs, p̄γ)
end
return p̂fail, all_samples, γs
end

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.7. summary 141

Iteration 1 Iteration 4 Iteration 8 Iteration 12 Figure 7.14. Adaptive multilevel


splitting applied to the inverted
1 pendulum system. The shaded
blue region represents the region
where the objective function is less
q (rad)

0 than the current threshold. The


samples gradually transition from
the nominal trajectory distribution
1 to the failure distribution as the al-
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 gorithm progresses. The algorithm
Time (s) Time (s) Time (s) Time (s) terminates on iteration 12 when all
elite samples are failures.

sampling, this step can be performed using the MCMC algorithms presented in
chapter 6. To improve the efficiency of the MCMC, we first resample by drawing
m samples uniformly from the elite samples.
To accurately estimate the probability of failure, the last iteration of the algo-
rithm must use a threshold of zero. Algorithm 7.11 iterates until the threshold
reaches zero, at which point all elite samples are failures. If we reach the maxi-
mum number of iterations before this criterion is met, the algorithm will force
the final threshold to be zero. However, if there are no failure samples in the final
iteration, the final conditional probability will be zero, causing the algorithm
to return an estimate of zero. Therefore, it is important to ensure that we allow
enough iterations for the algorithm to reach the final threshold.
Multilevel splitting is considered a nonparametric algorithm in that we estimate
the probability of failure without assuming a specific form for the conditional
distributions. This feature allows multilevel splitting to extend to systems with
complex, multimodal failure distributions. Furthermore, the adaptive nature
of algorithm 7.11 allows us to smoothly transition from the nominal trajectory
distribution to the failure distribution without specifying the intermediate dis-
tributions ahead of time. Figure 7.14 shows an example of adaptive multilevel
splitting applied to the inverted pendulum system, which has two modes in its
failure distribution.

7.7 Summary

• We can view the problem of estimating the probability of failure as a parameter


estimation problem and apply maximum likelihood or Bayesian methods.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
142 chap ter 7. failure probability estim ation

• For systems with rare failure events, we can use importance sampling to es-
timate the probability of failure by sampling from a proposal distribution
assigns higher likelihood to failure trajectories.

• The performance of importance sampling algorithms depends on the choice


of proposal distribution, and we want to select a proposal distribution that is
as close as possible to the optimal proposal distribution, which is the failure
distribution.

• We use adaptive importance sampling algorithms to automatically tune a


proposal or set of proposals based on samples.

• Sequential Monte Carlo is a nonparametric algorithm that can be applied to es-


timate the probability of failure for complex, multimodal failure distributions.

• By viewing the failure probability estimation problem as a special case of a


more general problem of estimating ratios of normalizing constants, we can
derive estimators that allow us to use complex proposal distributions for which
we only know the unnormalized density.

• Umbrella sampling and bridge sampling increase efficiency by introducing a


third density into the ratio of normalizing constants.

• Multilevel splitting is a nonparametric algorithm that estimates the probability


of failure using a series of conditional distributions.

7.8 Exercises
Exercise 7.1. The coefficient of variation of a random variable is defined as the ratio of
the standard deviation to the mean and is a measure of relative variability. Compute the
coefficient of variation for the estimator in equation (7.2). For a fixed sample size m, how
does the coefficient of variation change as pfail increases? For a fixed pfail , how does the
coefficient of variation change as the sample size m increases?

Solution: The coefficient of variation for the estimator in equation (7.2) is


q
pfail (1 pfail )
standard deviation m
=
mean pfail
s
1 pfail
=
mpfail

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.8. exercises 143

For a fixed m, the coefficient of variation will decrease as the true probability of failure
pfail increases. For a fixed pfail , the coefficient of variation will decrease as the sample size
m increases. The plots here show an example of these relationships.

m = 100 pfail = 0.01


1 10
Coefficient of Variation

Coefficient of Variation
0.8 8

0.6 6

0.4 4

0.2 2

0 0
0 0.2 0.4 0.6 0.8 1 0 100 200
pfail m

Exercise 7.2. Show that equation (7.24) reduces to equation (7.2) when q̄1 (t ) = 1{t 62
y} p(t ) and q̄2 (t ) = p(t ).

Solution: Since q̄1 is the unnormalized failure distribution, its normalizing constant is the
probability of failure (z1 = pfail ). Since q̄2 is the normalized nominal distribution, its
normalizing constant is z2 = 1. Plugging these values into equation (7.24) gives

pfail 1{ t 6 2 y } p ( t )
= E t ⇠ p(·)
1 p(t )
pfail = E t ⇠ p(·) [1{t 62 y}]

Given samples from the nominal distribution ti ⇠ p(·), we can approximate the above
equation as
1 m
m iÂ
p̂fail = 1{ti 62 y}
=1

Exercise 7.3. Show that equation (7.24) reduces to equation (7.8) when q̄1 (t ) = 1{t 62
y} p(t ) and q̄2 (t ) = q(t ).

Solution: Since q̄1 is the unnormalized failure distribution, its normalizing constant is
the probability of failure (z1 = pfail ). Since q̄2 is a normalized proposal distribution, its
normalizing constant is z2 = 1. Plugging these values into equation (7.24) gives

pfail 1{ t 6 2 y } p ( t )
= E t ⇠ p(·)
1 q(t )

p(t )
pfail = E t ⇠ p(·) 1{ t 6 2 y }
q(t )

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
144 chap ter 7. failure probability estim ation

Given samples from the proposal distribution ti ⇠ q(·), we can approximate the above
equation as
1 m p(ti )
m iÂ
p̂fail = 1{ti 62 y}
=1
q(ti )

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
8 Reachability for Linear Systems

In chapters 4 to 7, we covered a variety of sampling-based validation algorithms.


We now transition to formal methods, which can provide mathematical guaran-
tees that a system satisfies a given specification. In contrast with sampling-based
algorithms, which evaluate properties based on a finite sampling of trajectories,
formal methods consider the full set of possible trajectories. We first focus on the
task of reachability. In this chapter, we discuss algorithms for forward reachability
of linear systems. Forward reachability algorithms start with a set of initial states
and compute the set of states that the system reaches as it progresses forward
in time. This chapter begins by defining linear systems and the corresponding
reachability problem. We then discuss set propagation techniques for computing
reachable sets. Set propagation techniques can be computationally expensive for
high-dimensional systems with long time horizons, so we also discuss overap-
proximation techniques that allow us to reduce the computational complexity.
Finally, we outline an optimization-based approach to reachability analysis.

8.1 Forward Reachability

Forward reachability algorithms compute the set of states a system could reach
over a given time horizon. To perform this analysis, we need to make some
assumptions about the initial state and disturbances for the system. In the previous
chapters, we sampled initial states and disturbances from probability distributions, 1
One way to convert a probability
often with support over the entire real line. However, to perform reachability distribution to a bounded set is to
use the support of the distribution.
computations, we need to restrict the initial states and disturbances to bounded If the support of the distribution
sets.1 We assume that the initial state comes from a bounded set S and that the spans the entire real line, we can
select a region that contains most
disturbances at each time step come from a bounded set X . The disturbance set of the probability mass.
146 chap ter 8. re achability for linear systems

X is defined as follows:
82 3 9
< xa
> >
=
6 7
X = 4 xo 5 x a 2 X a , xo 2 X o , xs 2 X s (8.1)
>
: x >
;
s

where X a , Xs , and Xo are the disturbance sets for the agent, environment, and
sensor, respectively. 0.5
Given an initial state s and a disturbance trajectory x1:d = (x1 , . . . , xd ) with
depth d, we can compute the state of the system at time step d by performing a

v (m/s)
0
rollout (algorithm 4.6) and taking the final state. We denote this operation as
sd = Reach(s, x1:d ). By performing this operation on various initial states and 0.5
disturbances sampled from S and X , we find a set of points in the state space
that the system could reach at time step d. Figure 8.1 demonstrates this process 0.4 0.2 0 0.2 0.4
p (m)
on the mass-spring-damper system.
We define the reachable set at depth d as the set of all states that the system Figure 8.1. Samples from R5 for
could reach at time step d given all possible initial states and disturbances. We the mass-spring-damper system
with initial position between 0.2
write this set as
and 0.2 and initial velocity set to
zero. The disturbance sets for the
Rd = {sd | sd = Reach(s, x1:d ), s 2 S , xt 2 Xt , t 2 1 : d} (8.2) observation noise are bounded be-
tween 1 and 1. The gray points
where Xt represents the set of possible disturbances at time step t. We are often represent the initial states, the gray
lines show the trajectories, and the
interested in the full set of states that the system might reach in a given time blue points represent the states af-
horizon rather than at a specific depth d. We denote this set as R1:h and represent ter 5 time steps.
it as the union of the reachable sets at each depth up to the time horizon:

h
[
R1:h = Rd (8.3)
d =1

Figure 8.2 shows the reachable sets in R1:4 for the mass-spring-damper system.
Computing reachable sets allows us to understand the behavior of a system
over time. For example, we can use reachable sets to determine if a system remains
within a safe region of the state space.2 We call the set of states that make up 2
We could also determine if the
the unsafe region of the state space the avoid set and use this set to define a system reaches a goal region in the
state space. In this case, we would
specification for the system (algorithm 8.1). If the reachable set intersects with want to check if the reachable set is
the avoid set, the system violates the specification. contained within the goal region.
The reachability algorithms we discuss in this chapter apply to linear systems.
Linear systems are a class of systems for which the sensor, agent, and environment

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
8.2. set propagation techniques 147

R1 = S R2 R3 R4 Figure 8.2. The reachable sets that


make up R1:4 for the mass-spring-
damper system. The blue points
0.5 are samples from the reachable set
generated using the Reach opera-
v (m/s)

0 tor. The shaded blue regions show


the true reachable sets, and the
0.5 red regions make up the avoid set.
Since the reachable sets do not in-
0.3 0 0.3 0.3 0 0.3 0.3 0 0.3 0.3 0 0.3
tersect with the avoid set, the sys-
p (m) p (m) p (m) p (m) tem satisfies the specification.

struct AvoidSetSpecification <: Specification Algorithm 8.1. A specification that


set # avoid set checks if a trajectory avoids a given
end set. The set can be any type that
evaluate(ψ::AvoidSetSpecification, τ) = all(step.s ∉ ψ.set for step in τ) supports the ∉ operator. A com-
mon package for defining sets in
Julia is LazySets.jl. M. Forets and
C. Schilling, “LazySets.jl: Scalable
models are linear functions of the continuous state s, action a, observation o, and Symbolic-Numeric Set Computa-
tions,” Proceedings of the JuliaCon
disturbance x. We define these models mathematically as follows: Conferences, vol. 1, no. 1, pp. 1–11,
2021.
O(s, xo ) = Os s + xo (8.4)
p (o, xa ) = Po o + xa (8.5)
T (s, a, xs ) = Ts s + Ta a + xs (8.6)

where Os , Po , Ts , and Ta are matrices of the appropriate dimensions.3 Example 8.1 3


We could also multiply the dis-
outlines a common linear system that we will refer to throughout this chapter. turbances by matrices, but we omit
this step for simplicity.
A naïve approach to computing reachable sets would involve applying the
Reach operator to all possible initial states and disturbances. However, this ap-
proach is not feasible for systems with continuous states and disturbances since
there are an infinite number of possibilities. Instead, we rely on other mathemati-
cal analysis techniques to reason about the reachable set. The remainder of the
chapter discusses set propagation and optimization techniques that can be used
to compute reachable sets for linear systems.

8.2 Set Propagation Techniques

Set propagation techniques perform reachability by converting the operations in


equations (8.4) to (8.6) to set operations. To introduce these techniques, we will

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
148 chap ter 8. re achability for linear sy stems

A common example of a linear system is a mass-spring-damper system (see Example 8.1. A common exam-
ple of a linear system. The mass-
diagram in caption), which can be used to model mechanical systems such as spring-damper system consists of
a car suspension or a bridge. The state of the system is the position (relative a mass m attached to a wall by a
spring with spring constant k and a
to the resting point) p and velocity v of the mass (s = [ p, v]), the action is damper with damping coefficient c.
the force b applied to the mass, and the observation is a noisy measurement The system is controlled by a force
of state. The equations of motion for a mass-spring-damper system are b applied to the mass. The plots
show simulated trajectories of the
system for different levels of obser-
p0 = p + (v)Dt vation noise. With enough noise,
✓ ◆ the system becomes unstable.
0 k c 1
v = v+ p v + b Dt
m m m k
b
where m is the mass, k is the spring constant, c is the damping coefficient, m
and Dt is the discrete time step. Rewriting the dynamics in the form of
c
equation (8.6), we have
" #" # " #
1 Dt p 0
T (s, a, xs ) = k c + 1 b + xs = Ts s + T a a + xs
m Dt 1 m Dt v m Dt

We control the mass-spring-damper using a proportional controller such that


Po in equation (8.5) is the gain matrix. Similarly, we model the sensor as an
additive noise sensor such that Os in equation (8.4) is the identity matrix
and xo is the additive noise distributed uniformly within specified bounds.
Typically trajectories for this system with oscillate back and forth before
coming to rest. In general, we want to ensure that the system remains stable,
meaning that the position does not exceed some magnitude. The plots below
show simulated trajectories of the system for different levels of observation
noise. With enough noise, the system becomes unstable.

0.1  xo  0.1 1  xo  1 10  xo  10

0.4

0.2
p (m)

0.2

0.4
0 2 4 0 2 4 0 2 4
Time (s) Time (s) Time (s)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
8.2. set propagation techniques 149

Linear Transformation Figure 8.3. Visualization of linear


set operations. Applying a linear
transformation has the effect of ro-
" # tating, stretching, and translating
1.0 2.0 the set. The Minkowski sum of two
⇥ = sets is the set of all points obtained
0.5 1.0 by adding each point in the first set
to each point in the second set.

Minkowski Sum

first focus on set propagation for the one step reachability problem. We assume
we are given a set of initial states S and a set of disturbances X . Our goal is to
compute the set of states S 0 that the system could reach at the next time step.
Given a single initial state s and disturbance x, we can compute the next state s0
by applying equations (8.4) to (8.6) sequentially:

s0 = T (s, p (O(s, xo ), xa ), xs ) (8.7)


= Ts s + T a ( P o (Os s + xo ) + x a ) + xs (8.8)
= (Ts + T a P o Os )s + T a P o xo + T a x a + xs (8.9)

We can compute the reachable set at the next time step by applying equation (8.9)
to S and X . To perform this computation, we must define the operations in
equation (8.9) as set operations. In particular, we must be able to apply a linear
transformation, or matrix multiplication, to a set and add two sets together.
The multiplication of a set P by a matrix A is defined as

AP = {Ap | p 2 P } (8.10)

where the result is the set of all points obtained by multiplying each point in P
by A. The addition of two sets P and Q is defined as

P Q = {p + q | p 2 P , q 2 Q} (8.11)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
150 chap ter 8. re achability for linear systems

where the result is the set of all points obtained by adding each point in P to
each point in Q. This operation is referred to as the Minkowski sum of two sets
and is often denoted using the symbol.4 Figure 8.3 shows these operations in 4
The Minkowski sum is named af-
two-dimensional space. As we will discuss in the next section, we can efficiently ter Polish mathematician Hermann
Minkowski (1864–1909).
compute linear transformations and Minkowski sums for many common set types
such as hyperrectangles and polytopes.5 5
The LazySets.jl package in Julia
With these definitions in place, we can rewrite equation (8.9) using set opera- provides implementations of these
operations for many common sets.
tions as
S 0 = (Ts + Ta Po Os )S Ta Po Xo Ta X a Xs (8.12)
where S 0 is the one step reachable set. It is important that we simplify the system
dynamics into the form of equation (8.9) before applying set operations. If we
apply the equations without simplification, we may encounter a phenomenon
called the dependency effect, which occurs when a variable appears more than once
in a formula. Set operations fail to model this dependency, leading to conservative
reachable sets (see example 8.2). Algorithm 8.2 implements equation (8.12).

function get_matrices(sys) Algorithm 8.2. Algorithm for com-


return Ts(sys.env), Ta(sys.env), Πo(sys.agent), Os(sys.sensor) puting the one step reachable set
end for a linear system with initial
states from S and disturbances
function linear_set_propagation(sys, 𝒮, 𝒳) from X . We use the LazySets.jl
Ts, Ta, Πo, Os = get_matrices(sys) package to perform the set opera-
return (Ts + Ta * Πo * Os) * 𝒮 ⊕ Ta * Πo * 𝒳.xo ⊕ Ta * 𝒳.xa ⊕ 𝒳.xs tions in equation (8.12).
end

To compute reachable sets over a given time horizon using set propagation
techniques, we rely on the fact that the reachable set at time step d is a function of
the reachable set at time step d 1. Specifically, we can compute the reachable set
at time step d by applying equation (8.12) to the reachable set at time step d 1:

Rd = (Ts + Ta Po Os )Rd 1 Ta Po Xo Ta X a Xs (8.13)

Algorithm 8.3 implements this recursive algorithm for computing the reachable
set at each time step. The algorithm terminates when it reaches the desired time
horizon h and returns R1:h .
In addition to gaining insight into the behavior of a system, we can use reach-
able sets to verify that a system satisfies a given specification. For a given spec-
ification, we want to ensure that the reachable set does not intersect with its

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
8.2. set propagation techniques 151

Consider a simple system with the following component models: Example 8.2. Example of the de-
pendency effect on a simple sys-
tem.
O(s, xo ) = s
p (o, xa ) = Io
T (s, a, xs ) = s + a

where the state, action, and observation are two-dimensional and I is the
identity matrix. Suppose we want to compute the one-step reachable set S 0
when the initial set is a square centered at the origin with side length 1. If we
apply the sensor, agent, and environment models on the initial set without
simplification, we get O = S , A = IO = IS , and S 0 = S A = S IS .
The resulting set S IS is a square with side length 2 centered at the origin.
However, if we first simplify before switching to set operations, we get that
s0 = s s = 0. Thus, the true reachable set contains only the origin. The
plots below show this result.

S S 0 (no simplification) True S 0


2 2 2

1 1 1

2 1 1 2 2 1 1 2 2 1 1 2
1 1 1

2 2 2

This mismatch is due to an effect called the dependency effect, which leads
to conservative reachable sets. Because applying the set operations in order
does not account for the fact that the action depends on the state, it considers
worst-case behavior. For this reason, it is important to simplify the system
models into the form of equation (8.9) before applying set operations to avoid
unnecessary conservativeness. While this simplification is always possible
for linear systems, it is not always possible for the nonlinear systems we
discuss in the next chapter.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
152 chap ter 8. re achability for linear sy stems

abstract type ReachabilityAlgorithm end Algorithm 8.3. Linear forward


reachability using set propagation.
struct SetPropagation <: ReachabilityAlgorithm The 𝒮₁ and disturbance_set func-
h # time horizon tions are system-specific functions
end that return the initial state set and
disturbance set, respectively. We
function reachable(alg::SetPropagation, sys) assume the disturbance set is the
h = alg.h same at each time step. At each iter-
𝒮, 𝒳 = 𝒮₁(sys.env), disturbance_set(sys) ation, the algorithm computes the
ℛ = 𝒮 reachable set at the next time step
for t in 1:h by calling algorithm 8.2. It then
𝒮 = linear_set_propagation(sys, 𝒮, 𝒳) adds this set to the union of reach-
ℛ = ℛ ∪ 𝒮 able sets. The algorithm terminates
end when it reaches the desired time
return ℛ horizon h and returns R1:h .
end

¬(ψ::AvoidSetSpecification) = ψ.set Algorithm 8.4. Checking whether


function satisfies(alg::SetPropagation, sys, ψ) a system satisfies a given specifi-
ℛ = reachable(alg, sys) cation using set propagation. The
return !isempty(ℛ ∩ ¬ψ) algorithm computes the reachable
end set R1:h using algorithm 8.3 and
checks whether its intersection
with the avoid set ¬y is empty.

Suppose we want to compute R1:20 for the mass-spring-damper system with Example 8.3. Computing the
reachable sets for the mass-spring-
initial position between 0.2 and 0.2 and initial velocity set to zero. We damper system over a time horizon
assume the observation noise is bounded between 0.2 and 0.2. To perform of 20 steps. The reachable sets are
shown below, switching from light
reachability, we must implement the following functions for the system: blue to dark blue over time.
𝒮₁(env::MassSpringDamper) = Hyperrectangle(low=[-0.2,0], high=[0.2,0])
function disturbance_set(sys)
Do = sys.sensor.Do 0.4
low = [support(d).lb for d in Do.v]
high = [support(d).ub for d in Do.v] 0.2
v (m/s)

return Disturbance(ZeroSet(1), ZeroSet(2), Hyperrectangle(;low,high))


0
end
0.2
The disturbance_set function uses the support of the disturbance distri-
0.4
bution from the sensor to define the observation disturbance set. We use
0.4 0.2 0 0.2 0.4
ZeroSet from LazySets.jl for the agent and environment disturbances since
p (m)
they are deterministic. We can then compute the reachable set using algo-
rithm 8.3 and visualize the reachable sets in R1:20 (see figure on the right).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
8.2. set propagation techniques 153

0.1  xo  0.1 1.0  xo  1.0 2.5  xo  2.5 Figure 8.4. Reachable sets (bottom
row) for the mass-spring-damper
system with varying levels of ob-
0.5 servation noise compared to sam-
ples from a finite number of tra-
v (m/s)

jectory rollouts (top row). As the


0 noise bounds increase, the reach-
able sets move closer to the avoid
set. For the largest noise bound,
0.5
the reachable set intersects with
the avoid set, indicating that the
system violates the specification.
However, the finite number of tra-
0.5 jectory samples do not capture this
behavior. Formal methods such as
v (m/s)

0 reachability are able to identify this


violation by considering the entire
reachable set.
0.5

0.4 0.2 0 0.2 0.4 0.4 0.2 0 0.2 0.4 0.4 0.2 0 0.2 0.4
p (m) p (m) p (m)

complement such that


R1:h \ ¬y = ∆ (8.14)
For a specification that is defined as an avoid set, we can check if the system
satisfies the specification by verifying that the reachable set does not intersect
with the avoid set (algorithm 8.4). Figure 8.4 shows the reachable sets for the
mass-spring-damper system with varying levels observation noise compared to a Rd 1
finite sampling of reachable points. When the noise becomes large enough, the
Rd
reachable sets intersect with the avoid set, indicating that the system violates the
specification.
In general, the safety guarantee derived from equation (8.14) only holds up to
Figure 8.5. Example of an invariant
the horizon h. In other words, there is no guarantee that the system will not enter
set Rd such that Rd ✓ Rd 1 .
the avoid set after the time horizon. However, if we observe certain convergence
properties of the reachable set, we can extend the safety guarantee to infinite time. We can use LazySets.jl to check
6

Specifically, if at any point in algorithm 8.3 we find that Rd ✓ Rd 1 (figure 8.5), this property for convex sets. It
is also the case that if Rd ✓
we can conclude that Rd is an invariant set, meaning that the system will stay R1:d 1 , then R1:d is an invariant set.
within this set indefinitely. 6 However, this property is generally
more difficult to check since it re-
quires checking whether a set is a
subset of a nonconvex set.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
154 chap ter 8. reachability for linear sy stems

Figure 8.6. An example of a convex


and nonconvex set. For the noncon-
vex set, it is possible to draw a line
segment connecting two points in
the set that is not entirely contained
within the set.

a convex set a nonconvex set

8.3 Set Representations

To ensure that algorithms 8.2 to 8.4 are tractable, we must select set representations
that are computationally efficient. Desirable properties include:

• Finite representations: We should be able to specify the points that are contained
in the set without needing to enumerate all of them.

• Efficient set operations: We should be able to perform set operations such as


linear transformations, Minkowski sums, and intersection efficiently.

• Closure under set operations: A set representation is closed under a particular set
operation if applying the operation results in a set of the same type.

In this chapter, we will focus on convex set representations, which tend to have
these properties.7 A convex set is a set for which a line drawn between any two 7
Some nonconvex sets can also be
points in the set is contained entirely within the set. Mathematically, a set P is efficiently represented and manip-
ulated. A detailed overview is pro-
convex if we have vided in M. Althoff, G. Frehse, and
ap + (1 a)q 2 P (8.15) A. Girard, “Set Propagation Tech-
niques for Reachability Analysis,”
for all p, q 2 P and a 2 [0, 1]. Figure 8.6 illustrates this property. The rest of this Annual Review of Control, Robotics,
and Autonomous Systems, vol. 4,
section discusses a common convex set representation called polytopes.
pp. 369–395, 2021.

8.3.1 Polytopes
A polytope is defined as the bounded intersection of a set of linear inequalities.8 A 8
We can also define convex sets
linear inequality has the form a> x  b where a is a vector of coefficients, x is a such as ellipsoids using nonlinear
inequalities. O. Maler, “Computing
vector of variables, and b is a scalar. We refer to the set of points that satisfy a given Reachable Sets: An Introduction,”
linear inequality as a half space. A polyhedron is the intersection of a finite number French National Center of Scientific
Research, pp. 1–8, 2008.
of half spaces. If the polyhedron is bounded, we call it a polytope. Figure 8.7
illustrates these concepts in two dimensions.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
8.3. set representations 155

Half Space Polyhedron Polytope Figure 8.7. Example of a half space,


polyhedron, and polytope in two
dimensions. A half space is defined
a1 a1 a1 a1
> x by a single linear inequality, a poly-
= hedron is the intersection of multi-
b1 a2 a2
ple half spaces, and a polytope is a
 
Ax  b bounded polyhedron.
a1> b
a1> x  b1 x 1
a2> b2

a3

We can represent polytopes in different ways. An H-polytope is a polytope


represented as a set of half spaces. It is written in the form Ax  b where A
and b are formed by stacking the linear inequalities from the half spaces. A V -
polytope is a polytope represented as the convex hull of a set of vertices V , written
as conv(V ). The convex hull of a set of points V is the set of all possible convex
combinations of the points. A convex combination of a set of points {v1 , . . . vn } is
a linear combination of the form
Figure 8.8. The convex hull of a set
l 1 v1 + . . . + l n v n (8.16) of points.

such that Âin=1 li = 1 and li 0 for all i. Intuitively, the convex hull of a set of
points is the smallest convex set that contains all the points (figure 8.8).
It is always possible to convert between the two polytope representations; how-
ever, the calculation is nontrivial.9 Each representation has different advantages. 9
A detailed overview is provided
For example, H-polytopes are more efficient for checking whether a point belongs in G. M. Ziegler, Lectures on Poly-
topes. Springer Science & Business
to the set because we can simply check if it satisfies all the linear inequalities. In Media, 2012, vol. 152. In Julia,
contrast, V -polytopes are more efficient for set operations such as linear trans- LazySets.jl provides functional-
ity to convert between the two rep-
formations. To compute a linear transformation of a polytope represented as a resentations.
V -polytope, we can apply the transformation to each vertex to obtain the vertices
of the transformed polytope.
The Minkowski sum of two V -polytopes is

P1 P2 = conv({v1 + v2 | v1 2 V1 , v2 2 V2 }) (8.17)

where V1 and V2 are the vertices of P1 and P2 , respectively. In other words, we


can obtain all candidates for the vertices of the Minkowski sum by taking the sum

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
156 chap ter 8. reachability for linear systems

Figure 8.9. Computing the Min-


kowski sum of two V -polytopes.
P1 P2
P2

P1

of all pairs of vertices from the two polytopes. To determine which candidates are
actual vertices, we must determine which candidate vertices are on the boundary
of the convex hull. Figure 8.9 illustrates this process.
We can use these results to reason about the complexity of algorithm 8.3 if
we were to represent our sets as polytopes. We apply equation (8.12) at each
iteration, which involves four linear transformations and three Minkowski sums.
The number of candidate vertices resulting from computing the one step reachable
set using equation (8.12) is |S1 ||Xo ||X a ||Xs | where |P | represents the number of
vertices in polytope P . The number of candidate vertices for the reachable set
at depth d is then |S1 |(|Xo ||X a ||Xs |)d . We can prune the candidate vertices that
are not actual vertices by computing the convex hull of the candidate vertices,
but this operation can be expensive. 10 Therefore, the exponential growth in the 10
The most efficient algorithms for
number of candidate vertices creates tractability challenges for high-dimensional computing the vertices of the con-
vex hull of a set of points have
systems with long time horizons.11 a complexity of O(mv) where m
is the number of candidate ver-
tices and v is the number of ac-
8.3.2 Zonotopes tual vertices. In general, the num-
ber of actual vertices grows super-
A zonotope is a special type of polytope that avoids the exponential growth in linearly. For more details, see R. Sei-
candidate vertices for Minkowski sums. It is defined as the Minkowski sum of a del, “Convex Hull Computations,”
in Handbook of Discrete and Com-
set of line segments centered at a point c: putational Geometry, Chapman and
Hall, 2017, pp. 687–703.
m 11
Other polytope representations
Z = {c + Â ai gi | ai 2 [ 1, 1]} (8.18)
such as the Z -representation and
i =1
M-representation perform Min-
kowski sums more efficiently. More
where g1:m are referred to as the generators of the zonotope.12 We represent zono-
details are provided in S. Sigl
topes by a center point and list of generators: and M. Althoff, “M-Representation
of Polytopes,” ArXiv:2303.05173,
Z = (c, hg1:m i) (8.19) 2023.
12
Zonotopes can also be viewed as
linear transformations of the unit
hypercube.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
8.3. set representations 157

Figure 8.10. Iterative construction


of a zonotope centered at a point
c by taking the Minkowski sum of
its generators. The generators are
c c c c shown below.

Figure 8.10 shows the construction of a zonotope from its generators.


To apply a linear transformation to a zonotope, we apply the transformation to
the center and each generator:

AZ = (Ac, hAg1:m i) (8.20)

To compute the Minkowski sum of two zonotopes, we sum the centers and con-
catenate the generators:

Z Z 0 = (c + c0 , hg1:m , g1:m
0
0 i) (8.21)

Note that the number of generators in the resulting zonotope grows linearly
with the number of generators in each zonotope. Therefore, if we represent our
sets as zonotopes, the number of generators for the reachable set at depth d is
|S1 | + d|Xo ||X a ||Xs |. This linear growth represents a significant improvement
over the exponential growth in candidate vertices for generic polytopes.

8.3.3 Hyperrectangles
A hyperrectangle is a generalization of a rectangle to higher dimensions (fig-
ure 8.12). It is a special type of zonotope in which the generators are aligned with Polytopes
the axes. We may also work with linear transformations of hyperrectangles, which
Zonotopes
can always be transformed back to an axis-aligned representation. All hyperrect-
angles are zonotopes, and all zonotopes are polytopes; however, the reverse does
Hyperrectangles
not hold (figure 8.11). Hyperrectangles can be compactly represented as a center
point and a vector of half-widths. They can also be represented as a set of intervals
with one for each dimension. Unlike zonotopes, hyperrectangles are not closed
Figure 8.11. Zonotopes are a sub-
under linear transformations and Minkowski sums. class of polytopes, and hyperrect-
angles are a subclass of zonotopes.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
158 chap ter 8. reachability for linear sy stems

Figure 8.12. Example of a hyper-


rectangle in one, two, and three di-
mensions. In one dimension, a hy-
perrectangle is equivalent to an in-
terval.

8.4 Reducing Computational Cost

As noted in section 8.3.1, the number of candidate vertices for the reachable
sets in algorithm 8.3 grows exponentially with the time horizon and causes
computational challenges for high-dimensional systems. There are multiple ways
to reduce this computational burden. One way is to represent the initial state and
disturbance sets using zonotopes since the number of generators scales linearly
with the time horizon (see section 8.3.2). In this section, we will discuss another
technique to reduce the computational cost that relies on overapproximation.
The set P̃ represents an overapproximation of the set P if P ✓ P̃ . Typically, we
select the overapproximated set P̃ such that it is easier to compute or represent.
For example, we can use overapproximation to reduce the computational cost of
algorithm 8.3 by overapproximating the reachable set at each iteration with a set
that has fewer vertices (figure 8.13). We can then use this overapproximated set
as the initial set for the next iteration.
As long as the overapproximated reachable set does not intersect with the avoid
set, we can still use to it make claims about the safety of the system. However, if
the overapproximated reachable set does intersect with the avoid set, the results Figure 8.13. Overapproximating
are inconclusive. The violation could be due to unsafe behavior or the overap- the blue polytope with the purple
polytope. The purple polytope has
proximation itself. In this case, we could move to a tighter overapproximation or fewer vertices.
use a different method to verify safety (example 8.4).
Algorithm 8.5 modifies algorithm 8.3 to include overapproximation. Depend-
ing on the complexity of the reachable sets, we may not need to overapproximate
at every iteration, so we set a frequency parameter to control how often we overap-
proximate. Figure 8.14 demonstrates this idea on the mass-spring-damper system.
A more frequent overapproximation will result in greater computational efficiency
at the cost of extra overapproximation error in the reachable sets. We define over-
approximation error as the difference in volume between the overapproximated
reachable set and the true reachable set. 13
The Hausdorff distance is named
The overapproximation tolerance e places a bound on the Hausdorff distance after German mathematician Felix
between the overapproximated set and the original set.13 The Hausdorff distance Hausdorff (1868–1942).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
8.4. reducing computational cost 159

struct OverapproximateSetPropagation <: ReachabilityAlgorithm Algorithm 8.5. Overapproximate


h # time horizon linear forward reachability using
freq # overapproximation frequency set propagation. At each iteration,
ϵ # overapproximation tolerance the algorithm calls algorithm 8.2
end to compute the reachable set at the
next time step. If the current time
function reachable(alg::OverapproximateSetPropagation, sys) step matches up with the overap-
h, freq, ϵ = alg.h, alg.freq, alg.ϵ proximation frequency, the algo-
𝒮, 𝒳 = 𝒮₁(sys.env), disturbance_set(sys) rithm calls the overapproximate
ℛ = 𝒮 function from LazySets.jl to com-
for t in 1:h pute an e-close overapproximation
𝒮 = linear_set_propagation(sys, 𝒮, 𝒳) of the reachable set for use at the
ℛ = ℛ ∪ 𝒮 next time step. Section 8.4.2 de-
𝒮 = t % freq == 0 ? overapproximate(𝒮, ϵ) : 𝒮 scribes how the overapproximation
end function works.
return ℛ
end

R1 = S R2 R3 R4 Figure 8.14. An overapproxima-


tion of R1:4 for the mass-spring-
damper system. We reduce the
0.5 number of vertices in R3 by over-
approximating it with the purple
v (m/s)

0 polytope. This overapproximation


results in fewer vertices for R4 but
0.5 causes it to produce a conservative
estimate of the reachable set.
0.3 0 0.3 0.3 0 0.3 0.3 0 0.3 0.3 0 0.3
p (m) p (m) p (m) p (m)

between two sets P and P̃ is the maximum distance from a point in P to the
nearest point in P̃ . A lower value for e results in a less conservative overap-
proximation but may require more computation and result in a more complex
representation. The rest of this section discusses a technique for computing this
overapproximation.

8.4.1 Support Functions


We can overapproximate convex sets by sampling their support function. The
support function r of a set P ⇢ R n is defined as

r(d) = max d> p (8.22)


p2P

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
160 chap ter 8. reachability for linear sy stems

Suppose we want to determine if the mass-spring-damper system could Example 8.4. The effect of overap-
proximation on accuracy and com-
reach the avoid set within 40 time steps. To reduce computational cost, we use putational cost for the mass-spring-
algorithm 8.5 with an overapproximation frequency of 5 time steps. The plots damper system. The plots show
the reachable sets R1:40 using three
below show the reachable set R1:40 using three different overapproximation different overapproximation toler-
tolerances e. The plot on the right shows the number of vertices in Rd for ances. The plot below shows the
each depth d. number of vertices in the reachable
sets at each depth. If the tolerance
is too high, the reachable set may
e=0 e = 0.001 e=1 overlaps with the avoid set, and the
analysis is inconclusive.
0.5

100

Number of Vertices in Rd
v (m/s)

0 e=0
80 e = 0.001
60 e=1
0.5
40
0.4 0.2 0 0.2 0.4 0.4 0.2 0 0.2 0.4 0.4 0.2 0 0.2 0.4
20
p (m) p (m) p (m)
0
0 10 20 30 40
The first tolerance of e = 0 results in no overapproximation, but the number of Depth (d)
vertices grows quickly. The highest tolerance of e = 1 results in significantly
fewer vertices, but it is too conservative to the point where the reachable set
overlaps with the avoid set. Therefore, the results of the analysis with e = 1
are inconclusive. The middle tolerance of e = 0.001 strikes a balance between
the two extremes. With this tolerance, we are still able to verify safety while
reducing the computational cost.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
8.4. reducing computational cost 161

Bounding Halfspace Bounding Polyhedron Bounding Polytope Figure 8.15. Overapproximating


a polytope by evaluating its sup-
port function in various directions.
When we only use one direction
(left), the overapproximation is a
d
half space. By using multiple di-
rections (center), we can construct
a polyhedral overapproximation.
However, the two vectors only posi-
tively span the shaded cone, result-
d> p  r ( d ) ing in an unbounded overapprox-
imation. By adding a third direc-
tion (right), we can construct a set
of directions that positively span
R2 and produce a bounded over-
where d is a direction vector. The maximizer of the support function is called the approximate set.
support vector:
s (d) = arg max d> p (8.23)
p2P
s (d)
For polytopes, there will always be a support vector in a given direction d that
corresponds to one of its vertices (figure 8.16). In fact, evaluating the support
d

function of a V -polytope at a direction d involves computing d> v for each vertex


v 2 V and taking the maximum. Evaluating the support function of an H-polytope
requires solving a linear program. The support function of a zonotope can be
Figure 8.16. Support vector of a
computed in closed form as a function of its generators.14 polytope in a given direction d. The
The support function of a set P can be used to define a half space that contains support vector is the vertex of the
the set: polytope that maximizes the sup-
port function.
{p | d> p  r(d)} (8.24) 14
M. Althoff and G. Frehse, “Com-
bining Zonotopes and Support
By evaluating the support function on a set of directions D = d1:m and taking the
Functions for Efficient Reachabil-
intersection of the resulting half spaces, we obtain a polyhedral overapproxima- ity Analysis of Linear Systems,” in
tion of the set P : IEEE Conference on Decision and Con-
\ trol (CDC), 2016.
P̃ = {p | d> p  r(d)} (8.25)
d2D

For the overapproximation to be a polytope, the set D must be a positive spanning


set. The set of directions D represents a positive spanning set if we can construct
any point in R n as a convex combination of the directions in D .15 Figure 8.15 15
R. G. Regis, “On the Properties of
demonstrates this concept in R2 . Positive Spanning Sets and Positive
Bases,” Optimization and Engineer-
ing, vol. 17, no. 1, pp. 229–262, 2016.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
162 chap ter 8. reachability for linear systems

The choice of directions in D affects the tightness of the overapproximation.


In general, adding more direction vectors to D will decrease overapproxima-
tion error. As we approach all possible direction vectors, the overapproximation
converges to the set itself. However, more direction vectors will require more
computation to create the overapproximated set and will result in a more complex Figure 8.17. Two different over-
approximations of the blue poly-
overapproximate representation. tope that each use four evalua-
We want to select the directions in D to balance between overapproximation tions of the support function. The
first overapproximation evaluates
error and computational cost. If we have no prior information about the shape of the support function in the posi-
the set, a common choice is to add a direction in the positive and negative direc- tive and negative directions of the
tion of each axis. This choice will result in a hyperrectangular overapproximation. axes. The second overapproxima-
tion uses the directions of the diag-
However, other choices of directions may result in a tighter overapproximation onals of the unit square. The choice
(figure 8.17). Section 8.4.2 discusses an iterative algorithm for intelligently select- of directions affects the tightness of
the overapproximation.
ing these directions.

8.4.2 Iterative Refinement


One way to select the directions in D is to use a process called iterative refinement.16 16
This method is implemented
The algorithm proceeds as follows: in the LazySets.jl package as
the overapproximate function. For
more details, see G. K. Kamenev,
1. Begin with a positive spanning set of template directions D and compute the “An Algorithm for Approximating
corresponding overapproximate polytope by evaluating the support function Polyhedra,” Computational Math-
in each direction. A common choice is the positive and negative directions of ematics and Mathematical Physics,
vol. 4, no. 36, pp. 533–544, 1996.
the axes.

2. Compute an inner approximation by taking the convex hull of the correspond-


ing support vectors.

3. Compute the distance between each facet of the inner approximation and the
nearest vertex of the outer approximation.

4. Add the direction of the face that is furthest from the nearest vertex to D and
return to step 1.

The process is repeated until the maximum distance between the inner and outer
approximations is less than a specified tolerance e. Figure 8.18 shows the steps
involved in a single iteration of the algorithm, and figure 8.19 demonstrates the
process over multiple iterations.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
8.5. linear programming 163

Step 1 Step 2 Step 3 Step 4 Figure 8.18. Illustration of the steps


involved in a single iteration of
the iterative refinement algorithm.
In this example, the initial over-
approximation is a hyperrectangle.
The distance between the inner and
outer approximations is computed
for each face of the inner approx-
imation. The direction of the face
that is furthest from the nearest ver-
tex (purple arrow) is added to the
template directions.

Iteration 1 Iteration 2 Iteration 3 Converged Figure 8.19. The resulting overap-


proximated polytope for various it-
erations of the iterative refinement
algorithm. The Hausdorff distance
between the overapproximated set
and the true set decreases with
each iteration until it is within the
specified tolerance e = 0.7.

8.5 Linear Programming

Another technique for computing overapproximate reachable sets of linear sys-


tems is to directly evaluate the support function of the reachable set at a desired
depth d:
rd (d) = max d> s (8.26)
s2Rd

Similar to the support function of a polytope, the support function of a reachable


set can be used to construct an overapproximation of the reachable set. We form
the overapproximation by evaluating the support function in a set of directions
D and taking the intersection of the resulting half spaces.
We can solve the optimization problem in equation (8.26) using a linear program
solver. A linear program is an optimization problem where the objective function

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
164 chap ter 8. re achability for linear systems

and constraints are all linear. The linear program for equation (8.26) is

minimize d> sd
s1:d ,x1:d

subject to s1 2 S
(8.27)
xt 2 Xt for all t 2 1 : d
st+1 = Step(st , xt ) for all t 2 1 : d 1

where
Step(s, x) = (Ts + Ta Po Os )s + Ta Po xo + Ta xa + xs (8.28)
The decision variables in equation (8.27) are the state and disturbances at each
time step. The constraints enforce that the state and disturbances are within
their respective sets and that the state evolves according to equation (8.9). The
optimization problem in equation (8.27) can be solved efficiently using a variety
of algorithms.17 17
Modern linear programming
For the optimization problem in equation (8.27) to be a linear program, the solvers can solve problems with
thousands of variables and
sets S and Xt must be polytopes. We can write them as a set of linear inequalities constraints. H. Karloff, Linear
using their H-polytope representations. Algorithm 8.6 implements the linear Programming. Springer, 2008.
program for computing the support function of a reachable set at a particular
depth d. Given a desired time horizon h and a set of directions D , we can compute
an overapproximation of R1:h by evaluating the support function at each direction
for each depth. Algorithm 8.7 implements this process.
Similar to the polytope overapproximation in section 8.4, the choice of the
directions in D affects the tightness of the reachable set overapproximation. We
could select the directions to align with the axes or use more sophisticated meth-
ods like the iterative refinement algorithm in section 8.4.2. Since linear program
solvers are computationally efficient, another option is to simply evaluate the
support function at many randomly sampled directions. We could also select the 18
H. Abdi and L. J. Williams, “Prin-
directions using trajectory samples. Given a set of samples from the reachable set, cipal Component Analysis,” Wi-
ley Interdisciplinary Reviews: Com-
we can use principal component analysis (PCA)18 to determine the directions putational Statistics, vol. 2, no. 4,
that best capture the shape of the set.19 pp. 433–459, 2010.
The overapproximate reachable sets improve our understanding of the be-
19
O. Stursberg and B. H. Krogh,
“Efficient Representation and Com-
havior of the system. However, if our ultimate goal is to check intersection with putation of Reachable Sets for Hy-
a convex avoid set U , we can solve the problem exactly without the need for brid Systems,” in Hybrid Systems:
Computation and Control, 2003.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
8.5. linear programming 165

Ab(𝒫) = tosimplehrep(constraints_list(𝒫)) Algorithm 8.6. Computing the


support function of a reachable
function constrained_model(sys, d, 𝒮, 𝒳) set at a desired depth d. The
model = Model(SCS.Optimizer) constrained_model function con-
@variable(model, 𝐬[1:dim(𝒮),1:d]) structs an optimization model with
@variable(model, 𝐱o[1:dim(𝒳.xo),1:d]) the contraints in equation (8.27)
@variable(model, 𝐱s[1:dim(𝒳.xs),1:d]) that is compatible with the JuMP.jl
@variable(model, 𝐱a[1:dim(𝒳.xa),1:d]) package. It uses the Ab function to
convert a polytope to a set of linear
As, bs = Ab(𝒮) inequalities. Given this model and
(Axo, bxo), (Axs, bxs), (Axa, bxa) = Ab(𝒳.xo), Ab(𝒳.xs), Ab(𝒳.xa) a direction vector 𝐝, the ρ function
@constraint(model, As * 𝐬[:, 1] .≤ bs) solves the linear program and re-
for i in 1:d turns the value of the support func-
@constraint(model, Axo * 𝐱o[:, i] .≤ bxo) tion.
@constraint(model, Axs * 𝐱s[:, i] .≤ bxs)
@constraint(model, Axa * 𝐱a[:, i] .≤ bxa)
end

Ts, Ta, Πo, Os = get_matrices(sys)


for i in 1:d-1
@constraint(model, (Ts + Ta*Πo*Os) * 𝐬[:, i] + Ta*Πo * 𝐱o[:, i]
+ Ta * 𝐱a[:, i] + 𝐱s[:, i] .== 𝐬[:, i+1])
end
return model
end

function ρ(model, 𝐝, d)
𝐬 = model.obj_dict[:𝐬]
@objective(model, Max, 𝐝' * 𝐬[:, d])
optimize!(model)
return objective_value(model)
end

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
166 chap ter 8. re achability for linear systems

struct LinearProgramming <: ReachabilityAlgorithm Algorithm 8.7. Linear forward


h # time horizon reachability using linear program-
𝒟 # set of directions to evaluate support function ming. The system-specific 𝒮₁ and
tol # tolerance for checking satisfaction disturbance_set functions return
end the initial state set and distur-
bance set, respectively. For each
function reachable(alg::LinearProgramming, sys) depth, the algorithm creates the
h, 𝒟 = alg.h, alg.𝒟 contrained model and evaluates
𝒮, 𝒳 = 𝒮₁(sys.env), disturbance_set(sys) the support function at each direc-
ℛ = 𝒮 tion in 𝒟. It then constructs a poly-
for d in 2:h tope from the results and takes its
model = constrained_model(sys, d, 𝒮, 𝒳) union with the current reachable
ρs = [ρ(model, 𝐝, d) for 𝐝 in 𝒟] set. The algorithm returns R1:h .
ℛ = ℛ ∪ HPolytope([HalfSpace(𝐝, ρ) for (𝐝, ρ) in zip(𝒟, ρs)]) The tol input in a tolerance used
end by algorithm 8.8.
return ℛ
end

overapproximation. Specifically, we solve the following optimization problem:

minimize k sd uk
s1:d ,x1:D

subject to u2U
s1 2 S (8.29)

xt 2 Xt for all t 2 {1, . . . , d}


st+1 = Step(st , xt ) for all t 2 {1, . . . , d 1}
The solution to the optimization problem in equation (8.29) is the minimum
distance between any point in the reachable set and the avoid set. If this distance
is greater than zero, we can conclude that the reachable set does not intersect
the avoid set at depth d. If the distance is equal to zero, we can conclude that the
reachable set intersects the avoid set.
The norm in the objective function of equation (8.29) means that it is no longer
a linear program. It is, however, a convex program as long as the avoid set is
convex. If the avoid set is a union of convex sets, we can check intersection with
each component separately (see example 8.5). Convex programs can be solved
efficiently using a variety of algorithms.20 Algorithm 8.8 implements this check 20
S. P. Boyd and L. Vandenberghe,
for a given time horizon. For each depth, the algorithm solves the optimization Convex Optimization. Cambridge
University Press, 2004.
problem in equation (8.29) and checks if the objective value is within some
tolerance of zero. If the objective value is zero at any depth, the system does not
satisfy the specification.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
8.6. summ ar y 167

R2 R3 R4 R5 Figure 8.20. Overapproximate


reachable sets for the mass-spring-
damper system using linear pro-
gramming. On each plot, the x-axis
Axis Aligned

is position, and the y-axis is ve-


locity. Each row uses a different
strategy for selecting D . The first
row uses directions aligned with
the axes, the second row uses 10
randomly sampled directions, the
third row uses 50 randomly sam-
10 Random

pled directions, and the fourth row


uses directions selected based on
the principal components of the
trajectory samples. When we ran-
domly sample only 10 directions,
the overapproximation is too con-
servative to verify safety.
50 Random
PCA

8.6 Summary

• While the sampling-based methods in the previous chapters draw conclusions


based on a finite sampling of trajectories, formal methods such as reachability
analysis consider the entire set of possible trajectories.

• We can compute reachable sets for linear systems by propagating sets through
the system dynamics.

• We can efficiently propagate convex sets such as polytopes, zonotopes, and


hyperrectangles through linear equations.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
168 chap ter 8. reachability for linear sy stems

The avoid set for the mass-spring-damper system can be written as the Example 8.5. Checking whether
the mass-spring-damper system
union of two convex sets. Specifically, we require that | p| < 0.3. The first can reach the avoid set using con-
set is therefore represented by the linear inequality [1, 0]> s  0.3, and the vex programming.
second set is represented by the linear inequality [ 1, 0]> s  0.3. To check
whether the system could reach the avoid set, we run algorithm 8.8 for each
component of the avoid set. The system does not satisfy the specification if
the algorithm returns false for either component.

function satisfies(alg::LinearProgramming, sys, ψ) Algorithm 8.8. Checking whether


𝒮, 𝒳 = 𝒮₁(sys.env), disturbance_set(sys) a system could reach a convex
for d in 1:alg.h avoid set using convex program-
model = constrained_model(sys, d, 𝒮, 𝒳) ming. For each depth, the al-
@variable(model, u[1:dim(𝒮)]) gorithm constructs a constrained
Au, bu = Ab(¬ψ) model that considers the initial
@constraint(model, Au * u .≤ bu) state, disturbances, and system
𝐬 = model.obj_dict[:𝐬] dynamics. It then adds a vari-
@objective(model, Min, sum((𝐬[i, d] - u[i])^2 for i in 1:dim(𝒮))) able for the avoid set and min-
optimize!(model) imizes the squared distance be-
if isapprox(objective_value(model), 0.0, atol=alg.tol) tween the reachable set and the
return false avoid set (equivalent to minimiz-
end ing the norm). If the distance is
end zero (within the numerical toler-
return true ance) at any depth, the system does
end not satisfy the specification.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
8.6. summary 169

• If the number of vertices in the reachable set grows too large, we can produce
overapproximate representations by evaluating the support function on a set
of directions.

• We can overapproximate the reachable set directly by solving a linear program


to evaluate the support function.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
9 Reachability for Nonlinear Systems

This chapter extends the set propagation and optimization techniques discussed
in chapter 8 to perform reachability on nonlinear systems. A system is nonlin-
ear if its agent, environment, or sensor model contains nonlinear functions. The
reachable sets of nonlinear systems are often nonconvex and difficult to compute
exactly. This chapter begins by discussing several set propagation techniques
for nonlinear systems that overapproximate the reachable set.1 We then discuss 1
For more details on set propaga-
optimization-based nonlinear reachability methods. To minimize the overap- tion through nonlinear systems, re-
fer to M. Althoff, G. Frehse, and
proximation error introduced by these methods, we introduce a technique for A. Girard, “Set Propagation Tech-
overapproximation error reduction that involves partitioning the state space. We niques for Reachability Analysis,”
Annual Review of Control, Robotics,
conclude by discussing reachability techniques for nonlinear systems represented and Autonomous Systems, vol. 4,
by a neural network. pp. 369–395, 2021.

9.1 Interval Arithmetic

For nonlinear systems, the reachability function r (s, x1:d ) is a nonlinear function.
In contrast with the linear systems in chapter 8, we cannot directly propagate
arbitrary polytopes through nonlinear systems. We can, however, propagate
hyperrectangular sets2 using a technique called interval arithmetic.3 Interval arith- 2
We can also propagate sets that
metic extends traditional arithmetic operations and other elementary functions are linear transformations of hyper-
rectangles by reversing the linear
to intervals. An interval is a set of real numbers written as transformation to obtain an axis-
aligned hyperrectangle and per-
[ x ] = [ x, x ] = { x | x  x  x } (9.1) forming the analysis in the trans-
formed space.
where x and x are the lower and upper bounds of the interval, respectively. A 3
L. Jaulin, M. Kieffer, O. Didrit,
and É. Walter, Interval Analysis.
hyperrectangle, also known as an interval box, is the Cartesian product of a set of
Springer, 2001.
n intervals:
[x] = [ x1 ] ⇥ [ x2 ] · · · ⇥ [ x n ] (9.2)
172 chap ter 9. re achability for nonlinear systems

where [ xi ] = [ xi , xi ] for i 2 1 : n (figure 9.1).


Given two intervals [ x ] and [y], we define the interval counterpart of elementary
arithmetic functions as

[ x2 ]
[x] = [ x1 ] ⇥ [ x2 ]
[ x ] [y] = { x y | x 2 [ x ], y 2 [y]} (9.3)

where represents the addition, subtraction, multiplication, and division opera-


tions. We evaluate the interval counterparts of these functions as follows: [ x1 ]

[ x ] + [y] = [ x + y, x + y] (9.4) Figure 9.1. The Cartesian product


of two intervals forms a hyperrect-
[x] [y] = [ x y, x y] (9.5)
angle in R2 .
[ x ] ⇥ [y] = [min( xy, xy, xy, xy), max( xy, xy, xy, xy)] (9.6)
[ x ] / [y] = [min( x/y, x/y, x/y, x/y), max( x/y, x/y, x/y, x/y)] (9.7)

where the division operation is only defined when 0 2 / [ y ].


In general, we define the interval counterpart of a given function f ( x ) as

f ([ x ]) = [{ f ( x ) | x 2 [ x ]}] (9.8)

where the [·] operation takes the interval hull of the resulting set. The interval
hull of a set is the smallest interval that contains the set. Therefore, the interval
counterpart of a function returns the smallest interval that contains all possible
function evaluations of the points in the input interval.
We can define an interval counterpart for a variety of elementary functions.4 4
IntervalArithmetic.jl defines

For monotonically increasing functions such as exp, log, and square root, the the interval counterpart of many
elementary functions such as sin,
interval counterpart is cos, exp, and log in Julia.
f ([ x ]) = [ f ( x ), f ( x )] (9.9)
The interval counterpart for monotonically decreasing functions is similarly de-
fined. Nonmonotonic elementary functions such as sin, cos, and square require
multiple cases to define their interval counterparts. For example, the interval
counterpart for the square function is
8
<[min( x2 , x2 ), max( x2 , x2 )] if 0 2
/ [x]
[ x ]2 = (9.10)
:[0, max( x2 , x2 )] otherwise

Figure 9.2 shows example evaluations of the interval counterparts for the exp,
square, and sin functions.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
9.2. inclusion functions 173

f ( x ) = exp( x ) f ( x ) = x2 f ( x ) = sin( x ) Figure 9.2. Example of the interval


counterparts for the exp, square,
10 and sin functions.
4
8 1

6
f (x)

0
4 2

2 1
0 0
0 1 2 2 1 0 1 2 2 0 2
x x x

9.2 Inclusion Functions

For complex functions, it is not always possible to define a tight interval counter-
part. In these cases, we instead define an inclusion function. An inclusion function
[ f ]([ x ]) outputs an interval that is guaranteed to contain the interval from the
interval counterpart:
f ([ x ]) ✓ [ f ]([ x ]) (9.11)
In other words, inclusion functions output overapproximate intervals. We can
also define an inclusion function for multivariate functions that map from R k to
R where k 1.
For reachability analysis, our goal is to propagate intervals through the function
r (s, x1:d ), which maps its inputs to R n where n is the dimension of the state space.
We can rewrite r (s, x1:d ) as a vector of functions that map to R as follows:
2 3
r1 (s, x1:d )
6 .. 7
s0 = r (s, x1:d ) = 6
4 .
7
5 (9.12)
rn (s, x1:d )

where ri (s, x1:d ) outputs the value of the ith component of s0 . We can then define
the inclusion function for each ri (s, x1:d ) as [ri ]([s], [x1:d ]). By evaluating each
inclusion function for the input intervals [s] and [x1:d ], we obtain an overapproxi-
mate hyperrectangular reachable set. The rest of this section discusses techniques
to create these inclusion functions.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
174 chap ter 9. reachability for nonlinear systems

9.2.1 Natural Inclusion Functions


2
One simple way to create an inclusion function from a complex function is to
replace each elementary function with its interval counterpart. This type of inclu-

f (x)
0
sion function is known as a natural inclusion function. For example, the natural
inclusion function for f ( x ) = x sin( x ) is [ f ]([ x ]) = [ x ] sin([ x ]) (figure 9.3). 2
By replacing the elementary nonlinear components of the agent, environment,
2 1 0 1 2
and sensor models with their interval counterparts, we can create the natural
x
inclusion function for ri (s, x1:d ). We can then use interval arithmetic to propagate
hyperrectangular sets through the natural inclusion function. This computation Figure 9.3. Example evaluation of
the natural inclusion function for
will result in overapproximate reachable sets for nonlinear systems. Algorithm 9.1 f ( x ) = x sin( x ). The inclusion
implements the natural inclusion reachability algorithm and computes over- function produces an overapproxi-
mate interval.
approximate reachable sets up to a desired time horizon. Example 9.1 applies
algorithm 9.1 to the inverted pendulum problem.
As shown in figure 9.3 and example 9.1, natural inclusion functions tend to
be overly conservative. This property is due to the dependency effect, in which
multiple occurrences of the same variable are treated independently (see ex-
ample 8.2). In chapter 8, we were able to eliminate this effect by simplifying
equations to algebraically combine all repeated instances of a variable. However,
this simplification is not always possible for nonlinear functions such as the one
shown in figure 9.3. We can instead mitigate the dependency effect by using more
f (x)
sophisticated techniques for generating inclusion functions, which we discuss in
the remainder of this section.

9.2.2 Mean Value Inclusion Functions


f (x)
For functions that are continuous and differentiable, we can use the mean value
x x0 x
theorem to create an inclusion function. The mean value theorem states that for a
function f ( x ) that is continuous and differentiable on the interval [ x ], there exists Figure 9.4. Illustration of the mean
a point x 0 2 [ x ] such that value theorem on the function
f ( x ) = x2 over the interval [ x ] =
f (x) f (x) [1, 4].
= f 0 ( x0 ) (9.13)
x x

In other words, there exists a point in [ x ] where the slope of the tangent line
is equal to the slope of the secant line between the endpoints of the interval
(figure 9.4).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
9.2. inclusion functions 175

struct NaturalInclusion <: ReachabilityAlgorithm Algorithm 9.1. Nonlinear for-


h # time horizon ward reachability using natural
end inclusion functions. For each
depth, the algorithm gets the input
function r(sys, x) intervals using the system specific
s, 𝐱 = extract(sys.env, x) intervals function and computes
τ = rollout(sys, s, 𝐱) the output intervals using the nat-
return τ[end].s ural inclusion function of r (s, x1:d ).
end The IntervalArithmetic.jl.jl
package replaces functions with
to_hyperrectangle(𝐈) = Hyperrectangle(low=[i.lo for i in 𝐈], their interval counterparts so that
high=[i.hi for i in 𝐈]) we can propagate the intervals
directly through the rollout
function reachable(alg::NaturalInclusion, sys) function. The algorithm returns
𝐈′s = [] R1:h as the union of the output
for d in 1:alg.h
intervals.
𝐈 = intervals(sys, d)
push!(𝐈′s, r(sys, 𝐈))
end
return UnionSetArray([to_hyperrectangle(𝐈′) for 𝐈′ in 𝐈′s])
end

Suppose we want to compute reachable sets for the pendulum system with Example 9.1. Computing reach-
able sets for the inverted pendulum
bounded sensor noise on the angle and angular velocity using algorithm 9.1. system using its natural inclusion
We define the intervals and extract functions as follows: function. The plot shows the over-
approximated reachable set R2
function intervals(sys, d) computed using algorithm 9.1 com-
disturbance_mag = 0.01 pared to a set of samples from R2 .
θmin, θmax = -π/16, π/16
ωmin, ωmax = -1.0, 1.0
𝐈 = [interval(θmin, θmax), interval(ωmin, ωmax)] 2
for i in 1:2d

w (rad/s)
push!(𝐈, interval(-disturbance_mag, disturbance_mag))
end 0
return 𝐈
end
function extract(env::InvertedPendulum, x) 2
s = x[1:2]
𝐱 = [Disturbance(0, 0, x[i:i+1]) for i in 3:2:length(x)] 1 0 1
return s, 𝐱 q (rad)
end

The intervals function returns the initial state intervals followed by the
disturbance intervals for each time step. The extract function extracts these
intervals into the state and disturbance components. The plot in the caption
shows the overapproximated reachable set for after two time steps.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
176 chap ter 9. reachability for nonlinear systems

The mean value theorem implies that for any subinterval of [ x ], there exists a f (x)
point in [ x ] where the slope of the tangent line is equal to the slope of the secant
line between the endpoints of the subinterval. Therefore, given the center c of the
interval [ x ], there exists a point x 0 2 [ x ] such that

f (x) f (c)
= f 0 ( x0 ) (9.14) f (x)
x c
x c x0 x x
for any x 2 [ x ] (figure 9.5). Rearranging equation (9.14) gives
Figure 9.5. For a given subinter-
f ( x ) = f (c) + f 0 ( x 0 )( x c) (9.15) val [c, x ], there exists a point in [ x ]
where the slope of the tangent line
is equal to the slope of the secant
Because we know that x 0 2 [ x ], we can use equation (9.15) to create an inclusion
line between the endpoints of the
function for f ( x ) as follows: subinterval.

[ f ]([ x ]) = f (c) + [ f 0 ]([ x ])([ x ] c) (9.16)

where [ f 0 ]([ x ]) is an inclusion function for f 0 ( x ). It is common to define [ f 0 ]([ x ]) as 2

the natural inclusion function for f 0 ( x ). For multivariate functions, equation (9.16)

f (x)
generalizes to 0

[ f ]([x]) = f (c) + [r f ]([x])> ([x] c) (9.17)


2
where c is the center of the interval [x] and [r f ]([x]) is an inclusion function for
the gradient of f (x). 2 1 0 1 2
x
Equation (9.17) is a linearization of the nonlinear function f ( x ). Therefore, mean
value inclusion functions tend to perform well when the input interval covers a Figure 9.6. Mean value inclusion
function for f ( x ) = x sin( x ) over
region of the input space for which the function is nearly linear. Figure 9.6 shows
the same interval as figure 9.3.
an evaluation of the mean value inclusion function for the function in figure 9.3.
Because the function is roughly linear over the input interval, the mean value
inclusion function provides a tighter overapproximation. However, if we expand 2

the input interval to include nonlinear regions, the mean value inclusion function
f (x)

produces more conservative results (figure 9.7). 0

2
9.2.3 Taylor Inclusion Functions
2 1 0 1 2
Natural inclusion functions and mean value inclusion functions are special cases x
of a more general type of inclusion function known as a Taylor inclusion function.
Figure 9.7. Mean value inclusion
These inclusion functions use Taylor series expansions about the center of the
function for f ( x ) = x sin( x ) over
a wider interval.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
9.2. inclusion functions 177

Natural Inclusion First Order Second Order Third Order Figure 9.8. Evaluation of Taylor
inclusion functions of different or-
ders for f ( x ) = x sin( x ) over
2
the interval [ x ] = [ 1, 1] (top row)
and [ x ] = [ 1.5, 1.5] (bottom row).
f (x)

0 The natural inclusion function is


equivalent to the zeroth-order Tay-
2 lor inclusion function. As the or-
der increases, overapproximation
error decreases, especially over the
wider input interval.
2
f (x)

1.5 0 1.5 1.5 0 1.5 1.5 0 1.5 1.5 0 1.5


x x x x

input interval. In one dimension, a Taylor inclusion function of order n for a


function f ( x ) is defined as

f 00 (c) [ f (n) ]([ x ])


[ f ]([ x ]) = f (c) + f 0 (c)([ x ] c) + ([ x ] c )2 + · · · + ([ x ] c)n
2! n!
(9.18)
where c is the center of the interval [ x ] and [ f (n) ]([ x ]) is an inclusion function for
the nth-order derivative of f ( x ).5 5
It is possible to create a Taylor in-
Taylor inclusion functions can be similarly defined for multivariate functions. clusion function centered around
any point in the interval. However,
The second-order Taylor inclusion function for a multivariate function f (x) is choosing the center of the interval
minimizes overapproximation er-
1 ror.
[ f ]([x]) = f (c) + r f (c)> ([x] c) + ([x] c)> [r2 f ]([x])([x] c) (9.19)
2
where c is the center of the interval [x] and [r2 f ]([x]) is an inclusion function
for the Hessian of f (x).6 A zero-order Taylor inclusion function is equivalent 6
For higher order models, see R.
Neidinger, “Directions for Comput-
to the natural inclusion function, and a first-order Taylor inclusion function is
ing Truncated Multivariate Taylor
equivalent to the mean value inclusion function. Series,” Mathematics of Computation,
In general, higher-order Taylor inclusion functions provide tighter overapprox- vol. 74, no. 249, pp. 321–340, 2005.

imations (figure 9.8). However, the benefit of using higher-order terms depends
on the behavior of the function over the input interval. If the function is nearly
linear over the input interval, moving beyond a first-order model may not be

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
178 chap ter 9. re achability for nonlinear systems

worth the additional computational cost. In contrast, if the function is highly non-
linear over the input interval, a higher-order model may significantly decrease
overapproximation error.
Algorithm 9.2 implements first- and second-order Taylor inclusion functions
for reachability analysis. The algorithm computes overapproximate reachable sets
up to a desired time horizon by evaluating the Taylor inclusion function for each 7
Because Taylor inclusion func-
subfunction ri (s, x1:d ). Taylor inclusion functions can be used to create tighter tions can only be applied to func-
tions that are continuous and dif-
overapproximations of the reachable set than natural inclusion functions, espe- ferentiable, we use a modified ver-
cially for short time horizons (figure 9.9).7 However, the nonlinearities compound sion of the pendulum problem
in this chapter that does not ap-
for each time step, so Taylor models can be computationally expensive and result ply clamping in the environment
in significant overapproximation error for long time horizons (example 9.2). model.

struct TaylorInclusion <: ReachabilityAlgorithm Algorithm 9.2. Nonlinear forward


h # time horizon reachability using first- or second-
order # order of Taylor inclusion function (supports 1 or 2) order Taylor inclusion functions.
end For each depth, the algorithm gets
the input intervals using the sys-
function taylor_inclusion(sys, 𝐈, order) tem specific intervals function
c = mid.(𝐈) and applies either equation (9.17)
fc = r(sys, c) or equation (9.19) to each sub-
if order == 1 function ri (s, x1:d ) of r (s, x1:d ).
𝐈′ = [fc[i] + gradient(x->r(sys, x)[i], 𝐈)' * (𝐈 - c) The IntervalArithmetic.jl and
for i in eachindex(fc)] ForwardDiff.jl packages are
else compatible, which allows us to
𝐈′ = [fc[i] + gradient(x->r(sys, x)[i], c)' * (𝐈 - c) + evaluate gradients and Hessians
(𝐈 - c)' * hessian(x->r(sys, x)[i], 𝐈) * (𝐈 - c) over intervals. The algorithm
for i in eachindex(fc)] returns R1:h as the union of the
end
output intervals.
return 𝐈′
end

function reachable(alg::TaylorInclusion, sys)


𝐈′s = []
for d in 1:alg.h
𝐈 = intervals(sys, d)
𝐈′ = taylor_inclusion(sys, 𝐈, alg.order)
push!(𝐈′s, 𝐈′)
end
return UnionSetArray([to_hyperrectangle(𝐈′) for 𝐈′ in 𝐈′s])
end

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
9.2. inclusion functions 179

Natural Inclusion First Order Second Order Figure 9.9. Comparison of the one-
step overapproximated reachable
1 sets for the inverted pendulum sys-
tem using natural, first-order Tay-
lor, and second-order Taylor inclu-
w (rad/s)

sion functions. In this particular


0
example, a first-order Taylor in-
clusion function provides a sign-
ficantly tighter overapproximation
1 than the natural inclusion function.
1 0 1 1 0 1 1 0 1 The second-order Taylor inclusion
function does not provide a signifi-
q (rad) q (rad) q (rad) cant benefit over the first-order Tay-
lor inclusion function, indicating
that the dynamics are roughly lin-
ear over the input space.

The plots below show the overapproximate reachable sets for the inverted Example 9.2. Overapproximate
reachable sets for the inverted pen-
pendulum system produced by a first-order Taylor inclusion function at dif- dulum system using first-order
ferent depths. As the depth increases, the overapproximation error increases. Taylor inclusion functions at dif-
ferent depths. As the depth in-
This result is due to the increasing presence of nonlinearities in the system creases, the overapproximation er-
dynamics as we increase the depth. For the one-step reachable set (R2 ), the ror increases.
only nonlinearity present is the sine function in the pendulum dynamics.
As the depth increases, this nonlinearity will be repeated for each time step,
leading to larger overapproximation error.

R2 R3 R4 R5
1
w (rad/s)

1
1 0 1 1 0 1 1 0 1 1 0 1
q (rad) q (rad) q (rad) q (rad)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
180 chap ter 9. re achability for nonlinear systems

9.3 Taylor Models

While inclusion functions only operate over interval inputs and output reachable
sets in the form of hyperrectangles, Taylor models operate over other types of input
sets and are able to represent more expressive reachable sets.8 Similar to Taylor 8
K. Makino and M. Berz, “Taylor
inclusion functions, Taylor models are based on Taylor series expansions. An Models and Other Validated Func-
tional Inclusion Methods,” Interna-
nth-order Taylor model is a set represented as tional Journal of Pure and Applied
Mathematics, vol. 4, no. 4, pp. 379–
T = { p(x) + [↵] | x 2 X , ↵ 2 [↵]} (9.20) 456, 2003.

where X is the input set, p(x) is a polynomial of degree n 1, and [↵] is an


interval remainder term. In one dimension, the polynomial of an nth-order Taylor
model for the function f ( x ) over an input interval [ x ] is defined as

f 00 (c) f (n 1) ( x )
p( x ) = f (c) + f 0 (c)( x c) + (x c )2 + · · · + (x c )(n 1)
2! (n 1) !
(9.21)
where c is the center of the input interval. The interval remainder term, also
9
One way to handle this noncon-
known as the Lagrange remainder, bounds the sum of the rest of the terms in the
vexity is to represent sets using an
Taylor expansion over the input interval [ x ] so that the Taylor model is guaranteed extension of zonotopes called poly-
to contain the true output of the function. It is calculated as nomial zonotopes. More details can
be found in M. Althoff, “Reachabil-
[ f (n) ]([ x ]) ity Analysis of Nonlinear Systems
[a] = ([ x ] c)n (9.22) Using Conservative Polynomializa-
n! tion and Non-Convex Sets,” in In-
and is equivalent to the last term in a Taylor inclusion function of order n. In fact, ternational Conference on Hybrid Sys-
tems: Computation and Control, 2013.
passing an interval through a Taylor model performs the same computation as a Another representation called star
Taylor inclusion function of the same order. sets can also be used to repre-
sent nonconvex sets and has been
As the order of a Taylor model increases, overapproximation error tends to used for reachability. H.-D. Tran,
decrease (figure 9.10). Producing a zero-order Taylor model is equivalent to evalu- D. Manzanas Lopez, P. Musau, X.
ating the natural inclusion function, while producing a first-order Taylor model is Yang, L. V. Nguyen, W. Xiang, and
T. T. Johnson, “Star-Based Reacha-
equivalent to evaluating the mean value inclusion function. Taylor models begin bility Analysis of Deep Neural Net-
to deviate from inclusion functions for orders of two or higher. Second-order works,” in International Symposium
on Formal Methods, 2019.
Taylor models represent arbitrary polytopes, while second-order inclusion func-
tions only produce hyperrectangles. Higher-order Taylor models correspond to 10
M. Althoff, O. Stursberg, and
nonconvex sets, which are more difficult to understand and manipulate.9 For this M. Buss, “Reachability Analysis
of Nonlinear Systems with Uncer-
reason, we focus the remainder of this section on second-order Taylor models. tain Parameters Using Conserva-
Creating a second-order Taylor model for a function f (x) is a process known as tive Linearization,” in IEEE Confer-
ence on Decision and Control (CDC),
conservative linearization.10 Given an input set X and a center point c, the second- 2008.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
9.3. taylor models 181

Zero Order First Order Second Order Third Order Figure 9.10. Taylor models of dif-
ferent orders for f ( x ) = x sin( x )
1 over the interval [ x ] = [ 1.5, 0.0].
The dashed purple lines show re-
0 sults from a Taylor inclusion func-
f (x)

tion of the same order.


1

2
2 1 0 2 1 0 2 1 0 2 1 0

order Taylor model is

T = { f (c) + J(x c) + ↵ | x 2 X , ↵ 2 [↵]} (9.23)

where J is the Jacobian of f evaluated at c and [↵] is the interval remainder term.
The Jacobian is a generalization of the gradient to functions with multidimensional
outputs and is computed as
2 3
r f 1 (c) >
6 .. 7
J=6 4 .
7
5 (9.24)
r f n (c) >

where r f i (c) is the gradient of the ith component of f evaluated at c. The interval
remainder term is calculated using interval arithmetic as

1
[↵] = ([X ] c)> [r2 f ]([X ])([X ] c) (9.25)
2
where [X ] is the interval hull of X .11 11
If the input set X is represented
Equation (9.23) represents a linear approximation of the nonlinear function f as a zonotope, it is also possible
to overapproximate the remainder
with a remainder term that bounds the error of the approximation. Because all of term directly without taking the in-
the operations in equation (9.23) are linear, we can use it to propagate convex terval hull. This approach can re-
duce overapproximation error. M.
sets. In other words, if X is convex, we can rewrite the Taylor model in terms of Althoff, O. Stursberg, and M. Buss,
linear transformations and Minkowski sums as “Reachability Analysis of Nonlin-
ear Systems with Uncertain Pa-
T = f (c) + J(X c) [↵] (9.26) rameters Using Conservative Lin-
earization,” in IEEE Conference on
Decision and Control (CDC), 2008.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
182 chap ter 9. re achability for nonlinear systems

Algorithm 9.3 computes overapproximate reachable sets using conservative


linearization. Since Taylor models can be applied to functions with multidimen-
sional outputs, we can apply conservative linearization directly to the reachability
function r (s, x1:d ) without the need to break it into subfunctions. Example 9.3
demonstrates algorithm 9.3 on the inverted pendulum system. Conservative lin-
earization using Taylor models performs better than second-order Taylor inclusion
functions because it is able to output more expressive reachable sets. However,
for higher orders, Taylor models output nonconvex sets that are difficult to ma-
nipulate. In contrast, Taylor inclusion functions always output hyperrectangles
and do not suffer from this added complexity.

struct ConservativeLinearization <: ReachabilityAlgorithm Algorithm 9.3. Nonlinear forward


h # time horizon reachability using conservative lin-
end earization. At each depth, the al-
gorithm gets the input sets for the
to_intervals(𝒫) = [interval(lo, hi) for (lo, hi) in zip(low(𝒫), high(𝒫))] initial states and disturbances us-
ing the system specific sets func-
function conservative_linearization(sys, 𝒫) tion and applies equation (9.23) to
𝐈 = to_intervals(interval_hull(𝒫)) r (s, x1:d ). It uses the interval hull of
c = mid.(𝐈) the input set to calculate the inter-
fc = r(sys, c) val remainder term. The algorithm
J = ForwardDiff.jacobian(x->r(sys, x), c) returns R1:h as the union of the out-
α = to_hyperrectangle([(𝐈 - c)'*hessian(x->r(sys, x)[i], 𝐈)*(𝐈 - c) put sets.
for i in eachindex(fc)])
return fc + J * (𝒫 ⊕ -c) ⊕ α
end

function reachable(alg::ConservativeLinearization, sys)


ℛs = []
for d in 1:alg.h
𝒮, 𝒳 = sets(sys, d)
𝒮′ = conservative_linearization(sys, 𝒮 × 𝒳)
push!(ℛs, 𝒮′)
end
return UnionSetArray([ℛs...])
end

9.4 Concrete Reachability

Algorithms 9.2 and 9.3 tend to be computationally expensive when computing


reachable sets over long time horizons. As the depth d increases, the input di-
mension for the reachability function r (s, x1:d ) also increases. This increase in

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
9.4. concrete reachability 183

Suppose we want to compute reachable sets for the pendulum system with Example 9.3. Computing the one-
step reachable set for the inverted
bounded sensor noise on the angle and angular velocity using algorithm 9.3. pendulum system using conserva-
We define the sets function as follows: tive linearization. Conservative lin-
earization better approximates the
function sets(sys, d) reachable set than a second-order
disturbance_mag = 0.01 Taylor inclusion function.
θmin, θmax = -π/16, π/16
ωmin, ωmax = -1.0, 1.0
low = [θmin, ωmin]
high = [θmax, ωmax]
for i in 1:d
append!(low, [-disturbance_mag, -disturbance_mag])
append!(high, [disturbance_mag, disturbance_mag])
end
return Hyperrectangle(low=low, high=high)
end

The sets function returns the initial state set followed by the disturbance sets
for each time step. The plots below compare the one-step reachable set pro-
duced by conservative linearization with the set produced by a second-order
Taylor inclusion function. While conservative linearization still produces an
overapproximation, it captures the shape of the true reachable set better than
a Taylor inclusion function.

Taylor Inclusion Conservative Linearization

1
w (rad/s)

1
1 0 1 1 0 1
q (rad) q (rad)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
184 chap ter 9. re achability for nonlinear systems

1.2 Figure 9.11. Comparison of sym-


R1
bolic and concrete reachability al-
1.1 gorithms when computing R3 for
Symbolic

r (s1 , x1:3 ) R3
the inverted pendulum system.
1
The symbolic reachability algo-
0.9 rithm directly computes R3 with-
out explicity computing R2 by con-
sidering r (s1 , x1:3 ) as a single func-
0.5 0 0.5 1
tion. The concrete reachability al-
gorithm computes R2 and R3 sep-
R1
arately by considering r (s1 , x1:2 )
R2 R3 and r (s2 , x2:3 ) as separate func-
Concrete

r (s1 , x1:2 ) r (s2 , x2:3 )


tions. It uses R2 as the input set
for computing R3 .

input dimension causes the size of the gradient and Hessian to increase, leading
to more expensive computations. Furthermore, the nonlinearities in the agent,
environment, and sensor models compound over time, causing the accuracy of a
linearized model to degrade as the depth increases.
Concrete reachability algorithms address these issues by decomposing the reach-
ability function into a sequence of simpler functions. Instead of overapproximating
the reachable set over the entire depth at once, they compute the overapproximate
reachable set for each time step individually. At each iteration, they use the over-
approximate reachable set from the previous time step as the input set for the next
time step. We refer to this process as concrete reachability because we concretize
the reachable set at each time step by explicitly computing an overapproximate
representation. In contrast, the algorithms presented thus far maintain a sym-
bolic representation of the reachable set at each time step and only concretize the
reachable set at depth d. For this reason, we refer to these algorithms as symbolic
reachability algorithms. Figure 9.11 illustrates the difference between symbolic and
concrete reachability algorithms.
Algorithms 9.4 and 9.5 implement concrete versions of the symbolic reach-
ability algorithms presented in algorithms 9.2 and 9.3, respectively. For each
depth in the time horizon, they compute the overapproximate reachable set for
the next step using the overapproximate reachable set from the previous step.
Algorithm 9.4 concretizes the reachable set into a hyperrectangle at each time

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
9.4. concrete reachability 185

struct ConcreteTaylorInclusion <: ReachabilityAlgorithm Algorithm 9.4. Nonlinear forward


h # time horizon reachability using Taylor inclusion
order # order of Taylor inclusion function (supports 1 or 2) functions, concretizing the reach-
end able set at each time step. The algo-
rithm first gets the intervals for a
function reachable(alg::ConcreteTaylorInclusion, sys) depth of 2, which correspond to the
𝐈 = intervals(sys, 2) intervals for a one-step reachabil-
s, _ = extract(sys.env, 𝐈) ity computation. At each depth, it
𝐈′s = [s] computes the intervals for the next
for d in 2:alg.h time step and creates the input for
𝐈′ = taylor_inclusion(sys, 𝐈, alg.order) the next time step by extracting the
push!(𝐈′s, 𝐈′) new state. The algorithms return
s, _ = extract(sys.env, 𝐈) R1:h as the union of the output sets.
𝐈[1:length(s)] = s
end
return UnionSetArray([to_hyperrectangle(𝐈′) for 𝐈′ in 𝐈′s])
end

struct ConcreteConservativeLinearization <: ReachabilityAlgorithm Algorithm 9.5. Nonlinear forward


h # time horizon reachability using conservative lin-
end earization, concretizing the reach-
able set at each time step. The algo-
function reachable(alg::ConcreteConservativeLinearization, sys) rithm first gets the state and distur-
𝒮, 𝒳 = sets(sys, 2) bance sets for a depth of 2, which
ℛs = [] correspond to the sets required for
push!(ℛs, 𝒮) a one-step reachability computa-
for d in 2:alg.h tion. At each depth, it computes
𝒮 = conservative_linearization(sys, 𝒮 × 𝒳) the state set for the next time step
push!(ℛs, 𝒮) and and uses it to compute the next
end reachable set. The algorithms re-
return UnionSetArray([ℛs...]) turn R1:h as the union of the output
end sets.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
186 chap ter 9. reachability for nonlinear systems

step, while algorithm 9.5 concretizes the reachable set into a polytope at each
time step.
Concrete reachability algorithms are generally more computationally efficient
than symbolic reachability algorithms. However, it is not always clear whether
they will produce tighter overapproximations because there are multiple factors
that contribute to the overapproximation error. The only source of overapproxima-
tion error in symbolic reachability algorithms is the error introduced by linearizing
the reachability function and bounding the remainder term. We expect this lin-
earization error to be smaller for concrete reachability algorithms because they
linearize over a single time step rather than the entire time horizon.
While concrete reachability algorithms reduce overapproximation error due to
linearization, they introduce additional overapproximation error by concretizing
the reachable set at each time step into an overapproximate reachable set (fig-
ure 9.11). This error compounds over time, and the accumulation of this error is
often referred to as the wrapping effect.
The decrease in linearization error and introduction of the wrapping effect
for concrete reachability algorithms result in a tradeoff between concrete and
symbolic reachability (figures 9.12 and 9.13). The choice of which type of al-
gorithm to use depends on the specific system, the reachability algorithm, and
the desired tradeoff between computational efficiency and overapproximation
error. For example, if we are using linearized models for reachability and the
one-step reachability function is nearly linear, concrete reachability algorithms
may produce tighter overapproximations than symbolic reachability algorithms.
It is common to mix concrete and symbolic reachability algorithms to take advan-
tage of the strengths of each approach. For example, instead of concretizing the
reachable set at each time step, we can concretize the reachable set every k time
steps to reduce the wrapping effect.
Another benefit of using concrete reachability algorithms is that we can use
them to check for invariant sets. Similar to the check for invariance described for
the set propagation techniques in section 8.2, if we find that the reachable set at
a given time step in contained within the concrete reachable set at the previous
time step, we can conclude that the reachable set is invariant. For example, the
concrete versions of R6 in figures 9.12 and 9.13 are contained within the concrete
versions of R5 . Therefore, we can conclude that R6 is an invariant set in both
cases, meaning that the system will remain within the set for all future time steps.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
9.5. optimiz ation-based nonlinear reachability 187

R2 R3 R4 R5 R6 Figure 9.12. Comparison of sym-


bolic and concrete Taylor inclusion
algorithms when computing R1:6
for the inverted pendulum system.
Symbolic

Up to a depth of 5, the concretiza-


tion error dominates, so the sym-
bolic algorithm produces tighter
overapproximations. However, at
a depth of 6, the linearization er-
ror dominates, and the concrete al-
gorithm produces a tighter overap-
Concrete

proximation.

R2 R3 R4 R5 R6 Figure 9.13. Comparison of sym-


bolic and concrete conservative lin-
earization algorithms when com-
puting R1:6 for the inverted pen-
Symbolic

dulum system. Because concretiza-


tion using polytopes does not in-
troduce as much error as the hy-
perrectangles used by Taylor inclu-
sion functions, the linearization er-
ror dominates, and the concrete al-
gorithm produces tighter overap-
Concrete

proximations.

We cannot draw this conclusion using symbolic reachability algorithms because


the property requires consecutive concrete steps.

9.5 Optimization-Based Nonlinear Reachability

Similar to the ideas in section 8.5, we can overapproximate the reachable set of
nonlinear systems by sampling the support function. For symbolic reachability,

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
188 chap ter 9. re achability for nonlinear systems

we rewrite the optimization problem in equation (8.27) as

minimize d> sd
sd ,x1:d

subject to s1 2 S
(9.27)
xt 2 Xt for all t 2 1 : d
sd = r (s1 , x1:d )

For concrete reachability, we replace the last constraint with a constraint for each
time step as follows:

st+1 = r (st , xt:t+1 ) for all t 2 1 : d 1 (9.28)

The optimization problem in equation (9.27) is a nonlinear program because


r (s1 , x1:d ) is a nonlinear function of the state and disturbance. However, to ensure
that the overapproximation of the reachable set holds, we must solve this opti-
mization problem exactly. In general, we cannot find exact solutions for nonlinear
programs, so we must introduce further overapproximations. The rest of this
section discusses these methods.

9.5.1 Linear Programming through Conservative Linearization


We can transform the nonlinear program in equation (9.27) into a linear pro-
gram using the conservative linearization technique introduced in section 9.3.
Specfically, we create the following linear program:

minimize d> sd
sd ,x1:d ,↵

subject to s1 2 S
xt 2 Xt for all t 2 1 : d
" # (9.29)
s1 sc
sd = r (sc , xc ) + J +↵
x1:d xc
↵ 2 [↵]

where sc and xc are the centers of the state and disturbance sets and J is the
Jacobian of the reachability function evaluated at sc and xc . We introduce another
decision variable ↵ to represent the remainder term and constrain it to belong
to the Lagrange remainder interval [↵] (equation (9.25)). The concrete version

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
9.5. optimiz ation-based nonlinear reachability 189

10 Figure 9.14. Example of a piece-


8 wise linear function.
>4
>
>
x<2
8 >
<4x 4
> 2x<3
f (x) = 3x + 17 xx<5
>
>
6 >2x 8
> 5x<6
f (x) >
:
4 x 6

0 2 4 6 8 10 12 14
x

of this linear program is similarly defined by replacing equation (9.28) with a


conservative linearization for each time step.

9.5.2 Piecewise Linear Models


If the reachability function r (s, x1:d ) is piecewise linear, we can formulate the
optimization problem as a mixed-integer linear program (MILP). A piecewise linear
function is a function that comprises multiple linear functions that are activated 12
C. Sidrane, A. Maleki, A. Irfan,
based on the region of the input space (figure 9.14). A mixed-integer linear pro- and M. J. Kochenderfer, “OVERT:
An Algorithm for Safety Verifica-
gram is a linear program that includes some design variables that are constrained tion of Neural Network Control
to a set of integers. We can convert piecewise linear functions to mixed-integer Policies for Nonlinear Systems,”
Journal of Machine Learning Research,
constraints by introducing binary variables that activate the appropriate linear vol. 23, no. 117, pp. 1–45, 2022.
function based on the input. 13
More details can be found in V.
The process of encoding a piecewise linear function as a set of constraints begins Tjeng, K. Y. Xiao, and R. Tedrake,
by writing the function in terms of max and min functions (example 9.4). It is “Evaluating Robustness of Neural
Networks with Mixed Integer Pro-
possible to write any piecewise linear function in this form.12 We can then convert gramming,” in International Con-
the max and min functions to mixed-integer constraints.13 Example 9.5 shows ference on Learning Representations
the conversion of the ReLU function. Encoding the piecewise linear reachability (ICLR), 2018.

function as a set of mixed integer contraints turns equation (9.27) into a MILP, 14
A detailed overview of integer
which we can solve using a variety of algorithms.14 programming can be found in L. A.
Wolsey, Integer Programming. Wi-
While many real-world nonlinear systems do not have piecewise linear reach- ley, 2020. Modern solvers, such
ability functions, we can overapproximate them with piecewise linear bounds. as Gurobi and CPLEX, can rou-
tinely handle problems with mil-
First, we decompose the reachability function into a conjunction of elementary lions of variables. There are pack-
nonlinear functions (see example 9.6). For each nonlinear elementary function, ages for Julia that provide access
to Gurobi, CPLEX, and a variety of
other solvers.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
190 chap ter 9. re achability for nonlinear systems

Consider the following piecewise linear function (shown in the caption): Example 9.4. Writing the ReLU
function in terms of the max func-
8 tion.
<0 if x < 0
f (x) =
: x otherwise

This function is often referred to as the rectified linear unit (ReLU) function
and is commonly used in neural networks. We can rewrite this function in
terms of the max function as follows:

f ( x ) = max(0, x )

If x < 0, the max function will return 0, and if x 0, the max function will
return x.

we can derive piecewise linear lower and upper bounds over a given interval.
We can then convert those bounds to mixed-integer constraints and solve the
resulting MILP to overapproximate the reachable set.15 15
For more details on the process of
deriving the bounds and convert-
ing to constraints, see C. Sidrane, A.
9.6 Partitioning Maleki, A. Irfan, and M. J. Kochen-
derfer, “OVERT: An Algorithm for
Safety Verification of Neural Net-
The methods presented in this chapter tend to result in less overapproximation work Control Policies for Nonlin-
error when computing reachable sets over smaller regions of the input space. For ear Systems,” Journal of Machine
Learning Research, vol. 23, no. 117,
example, Taylor approximations are more accurate for points near the center of the
pp. 1–45, 2022.
region and become less accurate as we move away from the center (figure 9.15).
Therefore, we want to keep the input set for Taylor inclusion functions and Taylor
models as small as possible to minimize overapproximation error. 1

Based on this property, we can improve the performance of reachability al-


f (x)

gorithms by partitioning the input set into smaller regions and computing the 0

reachable set for each region separately. Specifically, we divide the input set S
into a set of smaller regions S (1) , S (2) , . . . , S (m) such that 1

m 2 1 0 1 2
[
S= S (i )
(9.30) x
i =1
Figure 9.15. First-order Taylor ap-
(i ) proximation (dashed blue line) for
To compute the reachable set at depth d, we compute the reachable set Rd for the function f ( x) = x sin( x)
each region S (i) separately and then combine the results to form the reachable (gray) at centered x = 0. The ap-
proximation is more accurate near
the center.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
9.6. partitioning 191

Suppose we want to solve an optimization problem with the following piece- Example 9.5. Mixed-integer formu-
lation of the ReLU function.
wise linear constraint:
y = max(0, x )
We will also assume that we know that x lies in the interval [ x, x ]. We can
encode this constraint using a set of mixed-integer constraints as follows:

yx x (1 a)
y x
y  xa
y 0
a 2 {0, 1}

The plots below iteratively build up the constrained region for each possible
value of a.
yx x y x y 0 y0
a=0

x x x x x x x x

yx y 0 yx y x
a=1

x x x x x x x x

When a = 0, y must be 0 and x must be between x and 0. When a = 1, y must


be equal to x and x must be between 0 and x.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
192 chap ter 9. re achability for nonlinear systems

Consider the nonlinear constraint y = x sin( x ) over the region 2  x  2. Example 9.6. Converting a nonlin-
ear equality constraint into a set
We can convert this constraint into a set of piecewise linear constraints by of mixed-integer constraints using
first decomposing the function into its elementary functions: piecewise linear bounds. We use
the OVERT.jl package to compute
the overapproximations.
y=x z
z = sin( x )
2x2

We then derive a piecewise linear lower bound z and upper bound z for
sin( x ) and rewrite the constraints as

y=x z
zzz
z = z( x )
z = z( x )
2x2

The plots below show the overapproximations of sin( x ) using different num-
bers of linear segments.

1 z( x ) z( x ) z( x )
f (x)

1 z( x ) z( x ) z( x )

2 0 2 2 0 2 2 0 2
x x x

The final step is to convert the piecewise linear functions z( x ) and z( x ) into
their corresponding mixed-integer constraints. The overapproximations be-
come tighter as the number of segments increases, but the computational
cost and the number of mixed integer constraints required to represent the
piecewise linear bounds also increases.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
9.6. partitioning 193

Input Set Input Partition Output Partition Output Set Figure 9.16. Computing the one
step reachable set for the inverted
S S (1) S (2) pendulum system using partition-
ing. The input set S is partitioned
R (3) R into four regions, and the reachable
R (4)
set for each region is computed sep-
R (1) arately using a first order Taylor in-
R (2)
clusion function. The union of the
S (3) S (4) resulting output sets forms the full
reachable set.

m=1 m=4 m = 16 m = 100 Figure 9.17. The effect of the num-


ber of subsets m in the partition on
the overapproximation error in R6
Input Partition

for the inverted pendulum system.


The reachability algorithm applied
to each subset uses a first-order
Taylor inclusion function. As m in-
creases, the overapproximation er-
ror decreases.
Output Partition

set for the entire input set:


m
[ (i )
Rd = Rd (9.31)
i =1
Figure 9.16 demonstrates this process. The union of the reachable sets for each
region is often nonconvex, which results in a benefit when performing reachability
analysis for nonlinear systems with nonconvex reachable sets.
As shown in figure 9.17, partitioning the input set can significantly reduce
overapproximation error. Finer partitions tend to result in more accurate reach-
able sets. In general, the performance is highly dependent on the partitioning
strategy. Figure 9.17 uses a uniform partitioning strategy, in which the input set
is divided into m equal-sized regions. However, this strategy may be intractable
for systems with high-dimensional input spaces because the number of subsets

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
194 chap ter 9. re achability for nonlinear systems

s1 s2 = f(W1 s1 + b1 ) s3 = f(W2 s2 + b2 ) Figure 9.18. Example of a two layer


neural network with two neurons
s11 s21 s31 in each layer.

s12 s22 s32

in a uniform partition grows exponentially with the dimension. In such cases, we


can use more sophisticated partitioning strategies, such as adaptive partitioning
based on samples, to improve the accuracy of the reachable set while keeping the
computational cost manageable.16 16
M. Everett, G. Habibi, C. Sun, and
J. P. How, “Reachability Analysis
of Neural Feedback Loops,” IEEE
9.7 Neural Networks Access, vol. 9, pp. 163 938–163 953,
2021.

We can use the techniques discussed in the previous sections to verify properties
of neural networks. Neural networks are a class of functions that are widely used
in machine learning and could be used to represent the agent, environment, or
sensor model. They are composed of a series of layers, each of which applies
an affine transformation followed by a nonlinear activation function.17 Given a 17
More details about the structure
set of inputs to a neural network, we are often interested in understanding the and training of neural networks are
found in appendix D.
possible outputs.18 For example, we may want to ensure that an aircraft collision 18
This process is sometimes re-
avoidance system will always output an alert when other aircraft are nearby. ferred to as neural network verifica-
Evaluating a neural network is similar to performing a rollout of a system. tion. A detailed overview of neural
network verfication can be found
However, instead of computing st+1 by passing st through the sensor, agent, and in C. Liu, T. Arnon, C. Lazarus,
environment models, we compute it by passing st through the tth layer of the C. Strong, C. Barrett, and M. J.
Kochenderfer, “Algorithms for Ver-
neural network. If st is the input to layer t, then the output st+1 is computed as ifying Deep Neural Networks,”
Foundations and Trends in Optimiza-
st +1 = f (Wt st + bt ) (9.32) tion, vol. 4, no. 3–4, pp. 244–404,
2021.
where Wt is a matrix of weights, bt is a bias vector, and f(·) is a nonlinear ac-
tivation function. Common activation functions include ReLU, sigmoid, and
hyperbolic tangent. Figure 9.18 shows an example of a two-layer neural network.
In this context, we can check properties of the neural network by computing the
reachable set of the output layer given an input set.
For piecewise linear activation functions, we can compute the exact reachable
set by partitioning the input space into different activation sets and computing

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
9.7. neural networks 195

f ( x ) = ReLU( x ) f ( x ) = sigmoid( x ) f ( x ) = tanh( x ) Figure 9.19. Example evaluations


of the interval counterparts for
2 1 three common neural network ac-
0.8 1 tivation functions.
1.5
0.6
f (x)

1 0
0.4
0.5
0.2 1
0 0
2 1 0 1 2 2 1 0 1 2 2 1 0 1 2
x x x

Input Set Layer 1 Output Output Set Figure 9.20. Computing the over-
approximate reachable set of a two-
s12 s22 s32 layer neural network using natural
inclusion functions. The true reach-
able set for each layer is shown in
blue, and the interval overapproxi-
s11 s21 s31 mation is shown in purple.

the reachable set for each subset separately.19 For example, we can compute exact 19
W. Xiang, H.-D. Tran, J. A. Rosen-
reachable sets for neural networks with ReLU activation functions (example 9.7). feld, and T. T. Johnson, “Reachable
Set Estimation and Safety Verifi-
However, the number of subsets grows exponentially with the number of nodes cation for Piecewise Linear Sys-
in the network. Therefore, exact reachability analysis is often intractable for large tems with Neural Network Con-
trollers,” in American Control Con-
neural networks, so it is common to instead use overapproximation techniques to ference (ACC), 2018.
bound the output set.
Similar to the nonlinear systems discussed earlier, we can use inclusion func-
tions to overapproximate the output set of neural networks. By replacing each
activation function with its interval counterpart, we obtain the natural inclusion
function for a neural network. Figure 9.19 shows an example evaluation of the
interval counterpart for the ReLU function. Evaluating the natural inclusion func-
tion for the network on a set of input intervals provides an overapproximation of
the possible network outputs (figure 9.20).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
196 chap ter 9. reachability for nonlinear systems

Suppose we want to propagate the input set S1 shown below through the Example 9.7. Exact reachability for
a two-layer neural network with
first layer of the neural network in figure 9.18. We first apply the linear ReLU activation functions.
transformation to obtain the pre-activation region Z2 = W1 S1 b1 :

Input Set Linear Transformation


s12 z12

s11 z11

Next, we need to apply the nonlinear ReLU activation function to Z2 to


compute S2 . We divide Z2 into four subsets for which the ReLU function is
linear. Each subset corresponds to a different activation pattern for the first
layer. An activation pattern describes which nodes in the layer are active for
a given input. A node is considered active if its input is greater than zero.
Each quadrant in Z2 corresponds to a different activation pattern and
maps to a different subset in the output set S2 . For example, in the first
quadrant, both nodes are active, so the ReLU has no affect on the output.
In the second quadrant, only the second node is active, so the inputs get
mapped to a line. The output set S2 is the union of the four subsets. The plots
below demonstrate this process.

Activation Patterns Output Subsets Output Set


z12 s22 s22

z11 s21 s21

To compute the final output set of the neural network in figure 9.18, we would
repeat this process for each of the subsets that comprise S2 . The final output
set will therefore be the union of 16 subsets.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
9.8. summary 197

As noted in section 9.2.1, natural inclusion functions tend to be overly conserva-


tive. Some techniques apply partitioning strategies to reduce overapproximation
error.20 Note that we cannot use Taylor inclusion functions or conservative lin- 20
W. Xiang, H.-D. Tran, and T. T.
earization for ReLU networks because the ReLU function is not differentiable at Johnson, “Output Reachable Set Es-
timation and Verification for Mul-
zero. However, we can use other techniques to create inclusion functions that are tilayer Neural Networks,” IEEE
tighter than the natural inclusion function.21 Transactions on Neural Networks and
Learning Systems, vol. 29, no. 11,
We can also evaluate the support function of the output set of a neural network pp. 5777–5783, 2018.
with d layers by solving the following optimization problem: 21
H. Zhang, T.-W. Weng, P.-Y.
Chen, C.-J. Hsieh, and L. Daniel,
minimize d> sd “Efficient Neural Network Robust-
sd ness Certification with General Ac-
subject to s1 2 S (9.33) tivation Functions,” Advances in
Neural Information Processing Sys-
sd = f n (s1 ) tems (NeurIPS), vol. 31, 2018.

where f n (s1 ) is the neural network function and S is the input set. For ReLU
networks, it is possible to write this optimization problem as a MILP by converting
each ReLU activation function into its corresponding mixed-integer contraints
(see example 9.5). To create the mixed integer constraints, we need an upper 22
M. Akintunde, A. Lomuscio, L.
and lower bound on the input to each ReLU. We can either select a sufficiently Maganti, and E. Pirovano, “Reach-
ability Analysis for Neural Agent-
large bound for all nodes22 or compute specific bounds by evaluating the natural Environment Systems,” in Inter-
inclusion function.23 To compute an overapproximation of the output set, we national Conference on Principles of
Knowledge Representation and Rea-
evaluate the support function in multiple directions. soning, 2018.
In addition to evaluating the support function, we can use the MILP formula- 23
V. Tjeng, K. Y. Xiao, and R.
tion to check other properties of the neural network by changing the objective Tedrake, “Evaluating Robustness
of Neural Networks with Mixed
function or adding constraints.24 For example, we can check if the output set Integer Programming,” in Interna-
intersects with a given avoid set or find the maximum disturbance that causes the tional Conference on Learning Repre-
network to change its output. In general, neural network verification approaches sentations (ICLR), 2018.
24
C. A. Strong, H. Wu, A. Zeljic,
can be combined with the techniques discussed in this chapter to verify closed- K. D. Julian, G. Katz, C. Barrett,
loop properties of systems that contain neural networks.25 and M. J. Kochenderfer, “Global
Optimization of Objective Func-
tions Represented by ReLU Net-
9.8 Summary works,” Machine Learning, vol. 112,
pp. 3685–3712, 2023.
25
M. Everett, G. Habibi, C. Sun,
• Reachable sets for nonlinear systems are often nonconvex and difficult to
and J. P. How, “Reachability Anal-
compute exactly. ysis of Neural Feedback Loops,”
IEEE Access, vol. 9, pp. 163 938–
• We can apply a variety of techniques to overapproximate the reachable sets of 163 953, 2021.
nonlinear systems.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
198 chap ter 9. re achability for nonlinear systems

• Interval arithmetic allows us to propagate intervals through elementary func-


tions.

• We can use interval arithmetic to create inclusion functions that provide over-
appoximate output intervals for nonlinear functions.

• Taylor inclusion functions overapproximate nonlinear functions by passing


intervals through their Taylor series approximations.

• An nth order Taylor models represent sets using a Taylor approximation of


degree n 1 and an interval remainder term that bounds the sum of the
remaining terms in the Taylor series.

• While Taylor inclusion functions always output hyperrectangular sets, Taylor


models more expressive reachable sets and tend to produce tighter overap-
proximations.

• We can sample the support function of the reachable set for nonlinear sys-
tems by solving an overapproximate linear program or mixed-integer linear
program.

• Because nonlinear reachability methods tend to produce tighter overapprox-


imations on smaller input sets, we can reduce overapproximation error by
partitioning the input space into smaller sets and computing the reachable set
for each smaller set.

• We can extend some of the techniques outlined in this chapter to analyze the
output sets of neural networks.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
10 Reachability for Discrete Systems

While the techniques in chapters 8 and 9 focus on reachability for systems with
continuous states, this chapter focuses on reachability for systems with discrete
states. We begin by representing the transitions of a discrete system as a directed
graph. This formulation allows us to use graph search algorithms to perform
reachability analysis. Next, we discuss techniques for probabilistic reachability
analysis, in which we calculate the probability of reaching a particular state or
set of states. We conclude by discussing a method to apply these techniques to
continuous systems by abstracting them into discrete systems.

10.1 Graph Formulation 0.2 0.1


0.8
Directed graphs are a natural way to represent the transitions of a discrete system.
s1 s2
A directed graph consists of a set of nodes and a set of directed edges connecting
the nodes. For discrete systems, each node represents a state of the system, and 0.9

each edge represents a transition between states (figure 10.1). We can also associate Figure 10.1. Graph representation
a probability with each edge to represent the likelihood of the transition occurring. of a discrete system with two states,
Algorithm 10.1 creates a directed graph from a discrete system. For each dis- s1 and s2 . The graph has a node
for each state and an edge originat-
crete state, it computes the set of possible next states and their corresponding ing from each state for each possi-
probabilities. It then adds an edge to the graph for each possible transition. Fig- ble transition. Each edge is labeled
with the probability of the transi-
ure 10.2 shows the graph representation of the grid world system. For systems tion. For example, when we are in
with large state spaces, it may be inefficient to store the full graph in memory. In s1 , we have a 0.8 probability of tran-
these cases, we can represent the graph implicitly using a function that takes in a sitioning from s1 to s2 .

state and returns its successors and their probabilities.


200 chap ter 1 0. reachability for discrete systems

function to_graph(sys) Algorithm 10.1. Converting


𝒮 = states(sys.env) a discrete system to a directed
g = WeightedGraph(𝒮) weighted graph using an extension
for s in 𝒮 of the Graphs.jl package (see
𝒮′, ws = successors(sys, s) appendix E for more details). For
for (s′, w) in zip(𝒮′, ws) each state returned by the states
add_edge!(g, s, s′, w) function, the algorithm calls
end the system-specific successors
end function to determine the set of
return g possible next states and their
end corresponding probabilities. It
then adds an edge to the graph for
each possible transition.

Figure 10.2. Graph representa-


tion of the grid world system.
Each node represents a grid world
state, and each edge represents a
possible transition between states.
Darker edges have higher probabil-
ities associated with them.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
10.2. reachable sets 201

10.2 Reachable Sets

To compute reachable sets, we ignore the probabilities associated with the edges of
the graph and focus only on its connectivity. The reachable sets are represented as
collections of discrete states. We focus on two types of reachability analysis: forward
reachability and backward reachability. Forward reachability analysis determines the
set of states that can be reached from a given set of initial states within a specified
time horizon.1 . Backward reachability analysis determines the set of states from 1
This process is sometimes re-
which a given set of target states can be reached within a specified time horizon. ferred to as bounded model checking

Figure 10.3 demonstrates the difference between the two types of reachability
analysis, and the rest of this section presents algorithms for each type.

Forward Reachability Backward Reachability Figure 10.3. Example of forward


and backward reachability on a dis-
crete system with six states. The
Step 1 Step 2 Step 3 Step 1 Step 2 Step 3 forward reachability algorithm de-
termines starts from the initial
S1 S1 S1 S1 S1 S1
set S1 and progresses forward
through the graph to determine the
set of states that can be reached
within a specified time horizon.
The backward reachability algo-
rithm starts from the target set S T
and progresses backward through
the graph to determine the set of
states from which the target state
ST ST ST ST ST ST
can be reached within a specified
time horizon.

10.2.1 Forward Reachability


To compute the forward reachable set from a set of initial states, we perform
a breadth-first search on the graph. We start with the initial set of states S1 = R1
and iteratively add the set of states reachable from the current set of states at the
next time step. In other words, Rd is the set of states for which the graph contains
an edge originating from a state in Rd 1 . We repeat this process for a specified
time horizon or until convergence. Algorithm 10.2 implements this technique,
and figure 10.4 shows the results on the grid world problem.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
202 chap ter 1 0. reachability for discrete systems

struct DiscreteForward <: ReachabilityAlgorithm Algorithm 10.2. Forward reachabil-


h # time horizon ity for discrete systems. The algo-
end rithm first creates the graph repre-
sentation of the system by calling
function reachable(alg::DiscreteForward, sys) algorithm 10.1. For each depth d,
g = to_graph(sys) it computes Rd by finding the set
𝒮 = 𝒮₁(sys.env) of states reachable from Rd 1 ac-
ℛ = 𝒮 cording to the edges in the graph
for d in 2:alg.h and checks for convergence. The
𝒮 = Set(reduce(vcat, [outneighbors(g, s) for s in 𝒮])) outneighbors function returns all
ℛ == (ℛ ∪ 𝒮) && break nodes connected to the current
ℛ = ℛ ∪ 𝒮 node through an outgoing edge.
end The algorithm returns the union of
return ℛ all sets, which corresponds to R1:h .
end

R1 R5 R10 Converged Figure 10.4. Forward reachable


sets for the grid world system.
Reachable states and their corre-
sponding edges are highlighted in
blue. In this example, the reach-
able set converges after 19 steps
and shows that all states are reach-
able from the initial state.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
10.3. satisfiability 203

The reachable set has converged once it no longer changes. If we find that
R1:d = R1:d 1 , the reachable set has converged, and R1:• = R1:d .2 We can also 2
This condition allows us to per-
check for invariant sets by relaxing this condition. Specifically, if R1:d ✓ R1:d 1 , form unbounded model checking, in
which the output holds over all pos-
we can conclude that R1:d is an invariant set and that the system will remain sible trajectories.
within this set for all future time steps (R1:• ✓ R1:d ). Performing this check
on discrete sets is straightforward because we can directly compare the states
contained in each set.

10.2.2 Backward Reachability


In contrast with forward reachability, which starts from a set of initial states and
progresses forward through the graph, backward reachability starts from a set
of target states S T and progresses backward through the graph. The target set is
often determined based on a specification for the system. For example, the target
set may represent a set of goal states or a set of states that should be avoided. The
backward reachable set B1:h represents the set of states from which the target set
can be reached within the time horizon h.
Algorithm 10.3 computes backwards reachable sets for discrete systems given
a reachability specification. It has a structure similar to algorithm 10.2. However,
instead of starting with the initial state set, it starts with the target set. It then
iteratively computes Bd as the set of states for which the graph contains an edge
originating from a state in Bd 1 and ending at a state in Bd . We can check for
convergence and invariance using the same conditions we use for forward reach-
ability. Figure 10.5 shows the results of applying the algorithm to the grid world
problem to compute the backward reachable sets from the goal and obstacle
states.

10.3 Satisfiability

We can use the forward and backward reachable sets of discrete systems to
determine whether they satisfy a reachability specification (figure 10.6). For
forward reachability, we check whether the target set intersects with the forward
reachable set. For backward reachability, we check whether the initial set intersects
with the backward reachable set. In both cases, these checks require us to compute
the full forward or backward reachable set. This process can be computationally
expensive, especially for systems with large state spaces.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
204 chap ter 10. reachability for discrete systems

struct DiscreteBackward <: ReachabilityAlgorithm Algorithm 10.3. Backward reacha-


h # time horizon bility for discrete systems. The al-
end gorithm first creates the graph rep-
resentation of the system by calling
function backward_reachable(alg::DiscreteBackward, sys, ψ) algorithm 10.1. For each depth d,
g = to_graph(sys) it computes Bd by finding the set
𝒮 = ψ.set of states from which Bd 1 can be
ℬ = 𝒮 reached according to the edges in
for d in 2:alg.h the graph and checks for conver-
𝒮 = Set(reduce(vcat, [inneighbors(g, s) for s in 𝒮])) gence. The inneighbors function
ℬ == (ℬ ∪ 𝒮) && break returns all nodes connected to the
ℬ = ℬ ∪ 𝒮 current node through an incom-
end ing edge. The algorithm returns
return ℬ the union of all sets, which corre-
end sponds to B1:h .

B1 B2 B5 Converged Figure 10.5. Backward reachable


sets for the grid world system.
The top row shows the backward
reachable sets from the goal state
(green), and the bottom row shows
the backward reachable sets from
the obstacle state (red). The reach-
able sets from the goal state con-
verge after 14 steps, while the
reachable sets from the obstacle
state converge after 11 steps. The
results show that the goal can be
reached from any state outside the
obstacle and that the obstacle can
be reached from any state outside
the goal.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
10.3. satisfiability 205

Forward Reachability Backward Reachability Figure 10.6. Checking whether a


discrete system satisfies a reacha-
bility specification using forward
Safe 3 Unsafe 7 Safe 3 Unsafe 7 (blue) and backward (red) reach-
able sets. If the forward reach-
able set overlaps (purple) with
S1 S1 S1 S1

the avoid set S T , the system is


unsafe. Furthermore, if the back-
ward reachable set overlaps (pur-
ple) with the initial set S1 , the sys-
tem is unsafe.

ST ST ST ST

10.3.1 Counterexample Search


If our only goal is to check whether a system satisfies a reachability specification,
we can use more efficient techniques that do not require us to compute the full
reachable set. For example, we could perform the same breadth-first search we
perform in algorithms 10.2 and 10.3 while only storing the states in the current and
previous reachable sets, Rd and Rd 1 , and performing a check for overlap with
the target set at each iteration. This approach tends to be more memory-efficient
than storing the full reachable set.
When the target set is an avoid set, we call this analysis counterexample search
because reaching the target set represents a counterexample that proves that the
system does not satisfy the specification.3 If the analysis converges without reach- 3
The term counterexample is an-
ing any states in the avoid set, we can conclude that the avoid set is not reachable other word for failure that is com-
monly used in formal verification.
and the system satisfies the specification. Conversely, if we reach a state in the 4
We can also perform depth first
avoid set, we can terminate the search early and return the counterexample. search to a fixed depth and increase
the depth if no counterexamples
If a counterexample exists, we can save computation by finding it early and are found. This process is known
terminating the search. In these cases, we may want to use a different graph as iterative deepening.

traversal algorithm such as depth-first search. Depth-first search explores the 5


The Graphs.jl package in
graph by following a single path to its maximum depth before backtracking.4 It Julia implements a variety of
graph search algorithms such
therefore allows us to more quickly begin searching over full trajectories, which as Dijkstra’s algorithm, the
could be more efficient for finding counterexamples. Floyd-Warshall algorithm, and
heuristic search. More details on
The use of more sophisticated graph search algorithms may further increase graph search are provided in S.
efficiency.5 Heuristic search algorithms such as the one introduced in section 5.3 Russell and P. Norvig, Artificial
Intelligence: A Modern Approach,
4th ed. Pearson, 2021.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
206 chap ter 1 0. reachability for discrete systems

Breadth First Depth First Heuristic Figure 10.7. Comparison of


breadth-first, depth-first, and
heuristic search for finding
counterexamples in the grid world
system. We use A⇤ search as the
heuristic search algorithm, which
significantly improves efficiency
over breadth-first and depth-first
search.

further increase efficiency by using heuristics to prioritize paths that are more
likely to lead to a counterexample. In cases where the system satisfies the specifi-
cation and no counterexample exists, these algorithms have the same computa-
tional complexity as breadth-first search. Figure 10.7 compares the performance
of breadth-first search, depth-first search, and heuristic search for finding coun-
terexamples in the grid world problem.

10.3.2 Boolean Satisfiability


Graph search algorithms may be inefficient for systems with large state spaces,
especially when each state has many neighboring states (example 10.1). In these
cases, it may be more efficient to formulate the reachability problem as a Boolean
satisfiability (SAT) problem. Solving a SAT problem involves searching for a
satisfying assignments of the Boolean variables, or propositions, in a propositional
logic formula (see section 3.4.1).6 6
Boolean satisfiability is also some-
times refered to as propositional
In the context of reachability analysis, the Boolean variables in the SAT problem
satisfiability.
represent the discrete states of the system at each time step. The propositional
logic formula encodes the possible initial states, transitions between states, and 7
More information on SAT solvers
the failure condition. We can then pass the formula to a variety of different SAT is provided in A. Biere, Handbook
of Satisfiability. IOS Press, 2009,
solvers, which use heuristics to efficiently search for a satisfying assignment.7 vol. 185. The Satisfiability.jl
If the SAT solver finds a satisfying assignment, the assignment corresponds to package provide a Julia interface
for many common SAT solvers.E.
a counterexample, and the system does not satisfy the specification. If the SAT Soroka, M. J. Kochenderfer, and S.
solver does not find a satisfying assignment, we can conclude that system satisfies Lall, “Satisfiability.jl: Satisfiability
the specification. Modulo Theories in Julia,” Jour-
nal of Open Source Software, vol. 9,
no. 100, p. 6757, 2024.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
10.3. satisfiability 207

The wildfire problem is an example of a problem in which graph search is Example 10.1. Demonstration of
difficulties that arise when apply-
intractable. Consider a wildfire scenario modeled as an n ⇥ n grid where ing graph search algorithms to the
each cell is either burning or not burning. At each time step, a burning cell wildfire problem.
has a nonzero probability of spreading the fire to each of its neighboring
cells. A burning cell will also remain burning at the next time step with
some probability. This problem has 2n possible states, and a state with b
2

burning cells has as many as 25b possible successors. For a 5 ⇥ 5 grid, the state
space has 225 = 3.4 ⇥ 107 states. For a 10 ⇥ 10 grid, that number increases to
2100 = 1.27 ⇥ 1030 possible states. The example below shows the successors
for a state where only the cell in the center is burning.

···

Even though only one cell is burning, there are still 32 successor states.
This number only increases as we increase the number of burning cells. A
state with 10 burning cells has as many as 250 = 1.13 ⇥ 1015 successors.
For most grid sizes, even partially computing and storing the graph for the
wildfire problem is intractable. For this reason, we cannot use graph search
algorithms for this problem and must turn to other methods such as Boolean
satisfiability.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
208 chap ter 10. reachability for discrete systems

For a system with states represented as Boolean vectors8 of length m and a


given horizon h, we define s1:h as a set of Boolean variables each of length m 8
SAT solvers require Boolean vari-
representing the states at each time step. The initial state set is encoded as a ables. Satisfiability modulo theories
(SMT) solvers extend SAT solvers
propositional logic formula I (s1 ), which returns true if s1 is in the initial state set to continuous variables. More in-
and false otherwise. Example 10.2 demonstrates how to encode the initial state formation about SMT is provided
in A. Biere, Handbook of Satisfiability.
of the wildfire problem. Next, we define a propositional logic formula for each IOS Press, 2009, vol. 185.
state transition T (st , st+1 ) that returns true if st+1 is a successor of st and false
otherwise (see example 10.3 for the wildfire problem). The failure condition is
the negation of the specification y.

Consider a wildfire problem with a 10 ⇥ 10 grid and a time horizon of h = 20. Example 10.2. Encoding the initial
state of the wildfire problem as a
The state at a particular time step is represented as a set of Boolean variables propositional logic formula using
that represent whether each grid cell is burning. The SAT problem will the Satisfiability.jl package.
therefore have 100 ⇥ 20 = 2000 Boolean variables representing the states
at each time step. We can represent the initial state as a propositional logic
formula that evaluates to true when the bottom left cell is burning and all
other cells are not burning. The following code implements this formula:
n = 10 # grid is n x n
h = 20 # time horizon
@satvariable(burning[1:n, 1:n, 1:h], Bool)
init = burning[1, 1, 1] # bottom left cell is burning
for i in 1:n, j in 1:n
if i ≠ 1 || j ≠ 1 # all other cells are not burning
init = init ∧ ¬burning[i, j, 1]
end
end

Combining the initial state and transition formulas with the failure condition,
we can create a single propositional logic formula that represents the reachability
problem:

I (s1 ) ^ ( T (s1 , s2 ) ^ T (s2 , s3 ) . . . ^ T (sh 1 , sh )) ^ ¬ y (s1:h ) (10.1)

A SAT solver will search the space of possible values for the Boolean variables s1:h
to find an assignment that satisfies the formula. A satisfying assignment corre-
sponds to a feasible trajectory that satisfies the failure condition. Therefore, if the
SAT solver determines that there are no satisfying assignments, we can conclude
that the system satisfies the specification. Example 10.4 demonstrates how to use
Boolean satisfiability to check reachability specifications for the wildfire problem.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
10.3. satisfiability 209

The following code implements the propositional logic formula for the tran- Example 10.3. Encoding the tran-
sitions of the wildfire problem as a
sitions of the wildfire problem: propositional logic formula using
transition = true the Satisfiability.jl package.
for i in 1:n, j in 1:n, t in 1:h-1
transition = transition ∧ (
burning[i, j, t+1] ⟹
(burning[i, j, t] ∨
burning[max(1, i-1), j, t] ∨
burning[min(n, i+1), j, t] ∨
burning[i, max(1, j-1), t] ∨
burning[i, min(n, j+1), t])
)
end

If a particular cell is burning at time t + 1, it must be the case that either


it was burning at time t or one of its neighbors was burning at time t. The
examples below show two evaluations of the transition proposition.

T ( ,
)= true

T ( ,
)= false

In the first case, both cells burning at time t + 1 were either burning at time t
or had a neighbor that was burning at time t. In the second case, the cell at
(3, 4) was not burning at time t, and none of its neighbors were burning at
time t.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
210 chap ter 1 0. reachability for discrete systems

Suppose there is a densely populated area in the top right cell of the wildfire Example 10.4. Checking reachabil-
ity specifications for the wildfire
grid, and we want to determine whether it might burn. We can encode the problem using Boolean satisfiabil-
failure condition as a propositional logic formula that evaluates to true when ity.
the top right cell is burning. We can then combine this formula with the
initial state and transition formulas from equation (10.1) and pass it to a SAT
solver to determine whether the top right cell is reachable. The following
code demonstrates this process:
ψ = ¬burning[n, n, t]
reachable = sat!(init ∧ transition ∧ ¬ψ)

For a 10 ⇥ 10 grid with a horizon of 20, the burning variable has 10 ⇥


10 ⇥ 20 = 2000 Boolean variables and therefore 22000 possible assignments.
However, the SAT solver can efficiently search this space to find a satisfying
assignment in a few seconds. If we decrease the time horizon to 18, the SAT
solver is able to determine in a similar amount of time that none of the 21800
possible assignments satisfy the formula. This result indicates that the top
right cell is not reachable within 18 time steps. 9
When we consider the transi-
tion probabilities, the system rep-
resents a discrete-time Markov
chain. More information on the
analysis of Markov chains is pro-
10.4 Probabilistic Reachability vided in J. R. Norris, Markov Chains.
Cambridge University Press, 1998.
Probabilistic reachability analysis computes the probability of reaching a target Open source software packages
such as PRISM (M. Kwiatkowska,
set by taking into account the probability of each transition between states.9 In
G. Norman, and D. Parker, “PRISM
some cases, the results allow us to build more confidence in a system than the 4.0: Verification of Probabilistic
reachable sets alone. For example, if the avoid set overlaps with the reachable Real-Time Systems,” in Interna-
tional Conference on Computer Aided
set, our reachability analysis will conclude that the system is unsafe even if the Verification, 2011.) and STORM
probability of reaching the avoid set is very low. Example 10.5 demonstrates this C. Hensel, S. Junges, J.-P. Ka-
property on the grid world problem. Probabilistic reachability analysis allows toen, T. Quatmann, and M. Volk,
“The Probabilistic Model Checker
us to uncover these scenarios and provide a more useful safety assessment that Storm,” International Journal on Soft-
focuses on actual risk. ware Tools for Technology Transfer,
pp. 1–22, 2022. implement these
analysis techniques.
10.4.1 Probability of Occupancy
Determining the probability of occupancy involves computing a distribution over
reachable states at each time step. We denote this distribution as Pt , where Pt (s)
is the probability of occupying state s at time step t. The algorithm begins with an

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
10.4. probabilistic reachability 211

Consider the grid world problem with a slip probability of 0.3. Running Example 10.5. Comparison of
reachable set analysis and proba-
algorithm 10.2 with a time horizon h = 9 leads to the conclusion that the bilistic forward reachability analy-
system is unsafe because the obstacle is included in the forward reachable sis on the grid world problem.
set. However, the probability of reaching the obstacle after 9 steps when
following the optimal policy is only 0.0004, and the system is more likely to
be in a state near its nominal path to the goal. In this scenario, the probabilistic
reachability provides a more useful assessment of the actual safety of the
system. The plots below show the reachable set (left) and the results of a
probabilistic reachability analysis (right).

Reachable Set Probabilistic Reachability


0.13

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
212 chap ter 1 0. reachability for discrete systems

initial state distribution P1 . It then computes the distribution at each subsequent


time step using the distribution from the previous time step and the transition
probabilities between states as follows

Pt+1 (s) = Â T (s0 , s) Pt (s0 ) (10.2)


s 2S
0

where T (s0 , s) is the probability of transitioning from state s0 to state s.


Algorithm 10.4 implements probabilistic forward reachability using the graph
representation of the system. The weights in the graph correspond to T (s0 , s) in
equation (10.2). The algorithm also uses the fact that the only nonzero terms in the
sum in equation (10.2) are the terms corresponding to the incoming neighbors of s
in the graph. The algorithm terminates after a desired time horizon h. Example 10.5
demonstrates this technique on the grid world problem.

struct ProbabilisticOccupancy <: ReachabilityAlgorithm Algorithm 10.4. Determining the


h # time horizon probability of occupancy for dis-
end crete systems. The algorithm first
creates the graph representation of
function reachable(alg::ProbabilisticOccupancy, sys) the system using algorithm 10.1. It
𝒮, g, dist = states(sys.env), to_graph(sys), Ps(sys.env) then begins with the initial state
P = Dict(s => pdf(dist, s) for s in 𝒮) distribution and iteratively com-
for t in 2:alg.h putes the distribution at each time
P = Dict(s => sum(get_weight(g, s′, s) * P[s′] step using equation (10.2). The
for s′ in inneighbors(g, s)) for s in 𝒮) inneighbors function returns all
end nodes in the graph that are con-
return SetCategorical(P) nected to the current node through
end an incoming edge. The algorithm
terminates when it has reached the
time horizon.

10.4.2 Finite-Horizon Reachability


In addition to the distribution over states at a particular time step, we may also
be interested in the probability that target state or set of states is reached within a
given time horizon. We denote this probability as Rt , where Rt (s) is the probability
of reaching the target set S T when starting from state s within t time steps. Unlike
the output of probabilistic occupancy analysis, Bt is not a distribution over states,
and the probability values for all states will not sum to 1.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
10.4. probabilistic reachability 213

The plots below show the results from probabilistic occupancy analysis on the Example 10.6. Determining occu-
pancy probabilities for the grid
grid world problem with a slip probability of 0.3. They show the distribution world problem.
over reachable states at different time steps with reachable states appearing
larger and darker states indicating a higher probability of reaching them.
The nominal path is highlighted in gray.

P5 P10 P15 P50


1

While the obstacle state is reachable in three of the plots, the probability of
occupying the obstacle state is low and the probability is much higher for
states near the nominal path. After 50 time steps, most of the probability
mass is in the goal state with a small portion in the obstacle state and the
other grid cells. At this point, the probability of being in the goal state is
0.981 and the probability of being in the obstacle state is 0.018. We can use
these numbers to draw conclusions about the overall safety of the system.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
214 chap ter 10. reachability for discrete systems

Similar to Pt , we can derive a recursive relationship to compute Rt such that


8
<1 if s 2 S T
R t +1 ( s ) = (10.3)
:Â 0 T (s, s ) Rt (s ) otherwise
0 0
s 2S

In other words, for states in the target set, the probability of reaching the target
set is 1. For all other states, the probability of reaching the target set within t + 1
time steps is sum of the probability of transitioning to each of its successors times
the probability that they reach the target set within t time steps. We initialize R1
to be 1 for states in the target set and 0 otherwise.
We can use the results of this analysis to identify dangerous states for the
system. Furthermore, if we know the initial state distribution P1 for the system,
we can determine the probability of reaching the target set within a given time
horizon by summing the probability of reaching the target set from each state
weighted by the probability of occupying that state at time t = 1:

Preach = Â Rh (s) P1 (s) (10.4)


s2S

Algorithm 10.5 implements finite-horizon probabilistic reachability using the


graph representation of the system given a reachability specification, and fig-
ure 10.8 demonstrates this technique on the grid world problem.

struct ProbabilisticFiniteHorizon <: ReachabilityAlgorithm Algorithm 10.5. Finite-horizon


h # time horizon probabilistic reachability for dis-
end crete systems. The algorithm first
creates the graph representation of
function reachable(alg::ProbabilisticFiniteHorizon, sys, ψ) the system using algorithm 10.1. It
𝒮, g, dist = states(sys.env), to_graph(sys), Ps(sys.env) then initializes the probability of
𝒮T = ψ.set reaching the target set from each
R = Dict(s => s ∈ 𝒮T ? 1.0 : 0.0 for s in 𝒮) state and iteratively computes the
for d in 2:alg.h probability at each time step using
R = Dict(s => s ∈ 𝒮T ? 1.0 : sum(get_weight(g, s, s′) * R[s′] equation (10.3). The algorithm ter-
for s′ in outneighbors(g, s)) for s in 𝒮) minates when it has reached the
end time horizon. It returns the proba-
return sum(R[s] * pdf(dist, s) for s in 𝒮) bility of reaching the target set in
end h time steps given the initial state
distribution, which is computed ac-
cording to equation (10.4).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
10.4. probabilistic reachability 215

Figure 10.8. Finite-horizon prob-


abilistic reachability for the grid
R5 R10 R15 R200 world problem with a slip prob-
1 ability of 0.6. The top row shows
the result when we set the target
set to the goal state, and the bot-
tom row shows the result when
we set the target set to the obsta-
cle state. States with nonzero prob-
ability are colored according to the
probability of reaching the target
0
set. Given an initial state distribu-
1 tion that places all probability on
the state in the bottom left corner,
the probability of reaching the goal
state within 200 time steps is 0.777,
and the probability of reaching the
obstacle state is 0.222.

10.4.3 Infinite-Horizon Reachability


1
If we run finite-horizon reachability analysis over a large horizon, the probability
0.8
of reaching the target set will begin to converge (see figure 10.9). Therefore, in goal
0.6

Preach
many scenarios, running the analysis for a sufficiently long time horizon is enough
0.4
to draw conclusions about the overall safety of the system. However, it is also obstacle
0.2
possible to compute the probability of reaching the target set in the limit as the
0
time horizon approaches infinity. This probability is known as the infinite-horizon 50 100 150 200

reachability probability, and we denote it as R• (s). h

To compute this probability, we rewrite the recursive relationship in equa- Figure 10.9. Probability of reach-
tion (10.3) as ing the goal state and the obsta-
cle state in the grid world problem
Rt+1 (s) = R1 (s) + Â TR (s, s0 ) Rt (s0 ) (10.5) with a slip probability of 0.6 as a
s0 2S function of the time horizon. We
where 8 assume the system is initialized in
the bottom left corner. As the hori-
<0 if s 2 S T zon increases, the probabilities be-
TR (s, s0 ) = (10.6)
: T (s, s0 ) otherwise gin to converge.

While this formulation is equivalent to equation (10.3), it allows us to compute the


infinite-horizon reachability probability by solving a system of linear equations.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
216 chap ter 1 0. r eachability for discrete systems

We can write this system in matrix form as

Rt+1 = R1 + TR Rt (10.7)

where Rt is a vector of length |S| such that the ith entry corresponds to Rt (si ),
and TR is a matrix of size |S| ⇥ |S| such that entry in the ith row and jth column
corresponds to TR (si , s j ).10 10
This formulation is equivalent to
For an infinite horizon, we have that a Markov reward process with an
immediate reward of 1 for all states
in the target set and 0 otherwise.
R• = R1 + TR R• (10.8) The states in the target set are ter-
minal states.
We can solve for R• by rearranging the terms in equation (10.8) to get

R• TR R• = R1 (10.9)
(I TR )R• = R1 (10.10)
R• = (I TR ) 1
R1 (10.11)

Algorithm 10.6 implements infinite-horizon probabilistic reachability by convert-


ing the graph representation of the system to matrix form and solving the system
of linear equations in equation (10.11). Example 10.7 shows the results on the
grid world problem for different slip probabilities.

struct ProbabilisticInfiniteHorizon <: ReachabilityAlgorithm end Algorithm 10.6. Infinite-horizon


probabilistic reachability for dis-
function reachable(alg::ProbabilisticInfiniteHorizon, sys, ψ) crete systems. The algorithm cre-
𝒮, g, dist = states(sys.env), to_graph(sys), Ps(sys.env) ates R1 and T R from the graph
𝒮Ti = [index(g, s) for s in ψ.set] representation of the system. The
R₁ = [i ∈ 𝒮Ti ? 1.0 : 0.0 for i in eachindex(𝒮)] to_matrix function converts the
TR = to_matrix(g) graph to a transition matrix rep-
TR[𝒮Ti, :] .= 0 resentation. The transition matrix
R∞ = (I - TR) \ R₁ can be represented as a sparse ma-
return sum(R∞[i] * pdf(dist, state(g, i)) for i in eachindex(𝒮)) trix if memory is constrained. The
end algorithm uses equation (10.11) to
compute the infinite-horizon reach-
ability probability from each state
and uses the initial state distribu-
tion to compute the overall proba-
10.5 Discrete State Abstractions bility of reaching the target set.

The methods discussed in this chapter apply only to discrete systems. However,
we can use them to produce overapproximate reachability results for continuous
systems by creating a discrete state abstraction (DSA). To create a discrete state

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
10.5. discrete state abstractions 217

Suppose we want to understand the probability of reaching the obstacle state Example 10.7. Infinite-horizon
probability of reaching the obsta-
for grid world problems with different slip probabilities. The plots below cle for different slip probabilities in
show the results of infinite-horizon reachability analysis with the obstacle as the grid world problem.
the target set for slip probabilities of 0.3, 0.5, and 0.7. For each slip probability,
we compute Pfail assuming we start in the bottom left corner of the grid.

Pfail = 0.018 Pfail = 0.102 Pfail = 0.49


1

As the probability of slipping increases, the probability of reaching the ob-


stacle state also increases, especially for states near the obstacle.

abstraction, we partition the continuous state space into a finite number of smaller
regions. We then create a graph where the nodes correspond to the regions, and
the edges correspond to transitions between regions. Figure 10.10 shows the
process of creating a DSA for the inverted pendulum problem.

10.5.1 Reachable Sets


To obtain overapproximate reachable sets of continuous systems using a DSA, it
is important to ensure that we overapproximate the transitions between regions.
In other words, if there exists a state in region S (i) that can transition to a state in
region S ( j) in one step, we add an edge between the nodes corresponding to S (i)
and S ( j) in the graph. This rule creates an overapproximation since there may be
some states in S (i) that cannot reach S ( j) in one step.
For continuous systems with bounded disturbances, we can calculate the reach-
able set using the algorithms in chapters 8 and 9 to determine the connectivity
of the graph. For each region in the partition S (i) 2 S , we use a forward reach-
ability algorithm to compute the exact or overapproximate one-step reachable
set R(i) . For any region S ( j) 2 S that intersects with R(i) , we add an edge from

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
218 chap ter 10. r eachability for discrete systems

Continuous State Space Partition DSA Figure 10.10. Process of creating


a discrete state abstraction for the
inverted pendulum problem. In
1
this particular example, we parti-
tion the continous state space uni-
formly into 36 regions. We then cre-
w (rad/s)

ate a graph where the nodes corre-


0
spond to the regions and the edges
correspond to the possible transi-
tions between regions.
1
1 0 1 1 0 1 1 0 1
q (rad) q (rad) q (rad)

the node corresponding to S (i) to the node corresponding to S ( j) . Example 10.8


implements this process to create a DSA for the inverted pendulum problem
using algorithm 10.1.
Once we have the graph representation of the DSA, we can apply algorithms 10.2
and 10.3 to determine its forward and backward reachable sets. We can then use
these results to determine overapproximate reachable sets for the continuous
system. Specifically, the overapproximate reachable set for the continuous sys-
tem is the union of all regions that correspond to a reachable node in the DSA.
Figure 10.11 shows this process for the inverted pendulum system.
The choice of partition in the DSA affects the amount of overapproximation
error in the reachable sets. In general, a finer partition will result in less overap-
proximation error at the cost of increased computational complexity (figure 10.12).
The examples in this chapter use a uniform partitioning strategy, which may be
computationally prohibitive for high-dimensional systems. Adaptive partitioning
strategies reduce the number of regions while maintaining a desired level of
accuracy.11 11
S. M. Katz, K. D. Julian, C. A.
Strong, and M. J. Kochenderfer,
“Generating Probabilistic Safety
10.5.2 Probabilistic Reachability Guarantees for Neural Network
Controllers,” Machine Learning,
For probabilistic reachability, the edges in the graph representation of the DSA vol. 112, pp. 2903–2931, 2023.
correspond to overapproximate transition probabilities. Specifically, the weight
on the edge from region S (i) to S ( j) must be greater than equal to the proba-
bility that any state in S (i) transitions to a state in S ( j) . The calculation of these

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
10.6. summary 219

R1 R2 R3 Converged Figure 10.11. Forward reachability


for the inverted pendulum system
using a discrete state abstraction.
The top row shows the reachable
sets (blue) of the DSA, and the bot-
tom row shows the corresponding
reachable sets (blue) in the contin-
uous system. The x-axis represents
the angle of the pendulum, and the
y-axis represents the angular veloc-
ity.

overapproximated probabilities is system specific. Example 10.9 demonstrates


this process for a continuum world problem with Gaussian disturbances on its
transitions. Given these transition probabilities, we can apply algorithm 10.4 or
algorithm 10.5 to determine the overapproximate probabilities of occupying or
reaching a set of target states.12 12
Since the transition probabilities
are overapproximations, we may
calculate intermediate overapprox-
10.6 Summary imate probabilities that are greater
than 1. In these cases, these prob-
abilities should be clamped to a
• We can represent discrete systems as directed graphs where the nodes represent value of 1.
states and the edges represent transitions between states.

• Forward reachable sets for discrete systems can be computed by applying


breadth-first search from a set of initial states.

• Backwards reachability algorithms begin with a set of target states and calculate
the set of states that can reach the target set in a given time horizon.

• If our only goal is check whether a system satisfies a reachability specification,


we may be able to use more efficient algorithms that do not directly compute
reachable sets such as heuristic search or Boolean satisfiability.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
220 chap ter 1 0. reachability for discrete systems

We can create a DSA for the inverted pendulum system using algorithm 10.1 Example 10.8. Creating a DSA for
the inverted pendulum system us-
by defining the states function to partition the state space into a grid of ing algorithm 10.1. The plots show
regions and the successors function to determine the connectivity of the the process of determining the con-
graph using a nonlinear forward reachability technique such as conservative nectivity of the graph for a single
linearization. Example implementations are as follows: region S (i) . The plot below shows
the graph for the final DSA with a
function states(env::InvertedPendulum; nθ=8, nω=8)
uniform partition of the state space
θs, ωs = range(-1.2, 1.2, length=nθ+1), range(-1.2, 1.2, length=nω+1)
into 64 regions.
𝒮 = [Hyperrectangle(low=[θlo, ωlo], high=[θhi, ωhi])
for (θlo, θhi) in zip(θs[1:end-1], θs[2:end])
for (ωlo, ωhi) in zip(ωs[1:end-1], ωs[2:end])] 1
return 𝒮
end

w (rad/s)
function successors(sys, 𝒮⁽ⁱ⁾) 0
_, 𝒳 = sets(sys, 2)
ℛ⁽ⁱ⁾ = conservative_linearization(sys, 𝒮⁽ⁱ⁾ × 𝒳)
ℛ⁽ⁱ⁾ = VPolytope([clamp.(v, -1.2, 1.2) for v in vertices_list(ℛ⁽ⁱ⁾)])
𝒮⁽ʲ⁾s = filter(𝒮⁽ʲ⁾->!isempty(ℛ⁽ⁱ⁾ ∩ 𝒮⁽ʲ⁾), states(sys.env)) 1
return 𝒮⁽ʲ⁾s, ones(length(𝒮⁽ʲ⁾s)) 1 0 1
end
q (rad)
The plots below demonstrate the successors function on an example state
S (i) . The function first computes R(i) using conservative linearization (left).
It then determines the regions S ( j) that intersect with R(i) (middle). Finally,
the function returns these regions so that they can be connected in the graph
(right). The edge weights can be ignored when computing reachable sets.

R (i )
w (rad/s)

0
S (i )

1
1 0 1 1 0 1 1 0 1
q (rad) q (rad) q (rad)

Algorithm 10.1 calls the successors function for each region in the partition
to determine the connectivity of the graph. The result is shown in the caption.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
10.6. summary 221

1 Figure 10.12. Converged overap-


proximate forward reachable sets
for the inverted pendulum using
w (rad/s)

0 different resolutions of the DSA. As


the resolution increases, the over-
approximation error decreases.
1
1 0 1 1 0 1 1 0 1
q (rad) q (rad) q (rad)

• Probabilistic reachability analysis allows us to compute the probability of


reaching a set of target states in a finite or infinite time horizon.

• We can convert continuous systems into discrete systems by producing a


discrete state abstraction of the continuous system.

• We can apply the reachability algorithms for discrete systems to a DSA to


determine overapproximate reachable sets for its corresponding continuous
system.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
222 chap ter 10. reachability for discrete systems

Suppose we have a continuum world problem with Gaussian disturbances Example 10.9. Overapproximation
the transition probabilities for the
on its transitions. For example, if the agent takes the up action, its next DSA of the continuum world sys-
position is sampled from a Gaussian distribution with a mean 1 unit above tem.
its current state and a standard deviation of 1 in each direction. In other words,
T (s, s0 ) = N (s0 | s + d, I) where d is the direction vector corresponding to
the action taken in the state s. Our goal is to determine the overapproximated
transition probabilities T (S , S 0 ) for a DSA of the continuous system.
To obtain the probability of transitioning from a specific state s to a
region in the partition S 0 , we integrate the transition function such that
R
T (s, S 0 ) = S 0 T (s, s0 ) ds0 . To obtain an overapproximation of the transition
probabilities, we select the transition from the current region S 0 that re-
sults in the highest probability of reaching the target region S 0 such that
T (S , S 0 ) = maxs2S T (s, S 0 ). The plots below show the transition probabili-
ties for a single state s to the regions in the DSA. The plots below demonstrate
this process.

T (s, s0 ) T (s, S 0 ) T (S , S 0 )

s s S

The maximization in the formula for T (S , S 0 ) finds the state in S that puts
the highest amount of probability mass in S 0 . The plots below demonstrate
this maximization for three different next regions S 0 . This process produces
an overapproximation of the transition probabilities since we assume all
states in S transition to the worst-case next state.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
11 Runtime Monitoring

11.1 Coming Soon


12 Explainability

This chapter focuses on understanding system behavior through explanations. An


explanation is a description of a system’s behavior that helps a human understand
why it behaves in a particular manner. In this chapter, we discuss several types of
explanations. We begin by discussing policy visualization techniques that allow
us to interpret an agent’s policy. We then discuss feature importance techniques
to help us understand the input features that are most important to the behav-
ior of a system. For systems with complex policies, we discuss ways to create
interpretable surrogate models that approximate the system’s behavior. We also
discuss techniques for generating counterfactual explanations that change the
outcome of a particular scenario by making small changes to important features.
We conclude by introducing methods to categorize the failure modes of a system.

12.1 Explanations

We often desire explanations of system behavior when metrics such as failure


probabilities or reachable sets are insufficient for an adequate understanding of
the system. For example, it may be impossible to capture all possible edge-case
behaviors when creating a model of a complex system, which may cause us to miss
potential failure modes. We also may not be able to fully specify our objectives
for a system using metrics or logical specifications. This incompleteness may lead
to an alignment problem (section 1.1) in which the metrics and specifications
used to evaluate a system do not perfectly capture our true objectives. In this case,
explanations of system behavior may provide a better understanding of whether
the behavior is aligned with our objectives. The results can be used to debug the
system by informing changes to the policy, model, or specifications.
226 chap ter 1 2. explainability

Collision Avoidance Inverted Pendulum Figure 12.1. Visualization of the


policies for the collision avoidance
400
1 and inverted pendulum systems by
plotting trajectory rollouts. We plot
200 time on the horizontal axis and one
of the state variables on the vertical

q (rad)
h (m)

0 0 axis.

200

1
400
0 10 20 30 40 0 0.2 0.4 0.6 0.8 1
Time (s) Time (s)

Explanations are also important to the stakeholders of a system. They can be


used to calibrate trust of end users by providing insight into a system’s decision-
making process. This insight helps users understand the strengths of a system as
well as its potential weaknesses. Explanations can also help stakeholders check
the fairness of a system by identifying the factors that influence its decisions.
Moreover, the end users of high-stakes decision-making systems such as loan
approval systems often have a right to an explanation. The General Data Protection
Regulation (GDPR) in the European Union requires that users be provided with
an explanation of automated decisions that significantly affect them.1 1
B. Goodman and S. Flaxman, “Eu-
The algorithms presented in this chapter provide descriptions of system be- ropean Union Regulations on Al-
gorithmic Decision-Making and a
havior to human operators or stakeholders. A good description should be inter- ‘Right to Explanation’,” AI Maga-
pretable to humans in a way that allows them to explain and predict the system’s zine, vol. 38, no. 3, pp. 50–57, 2017.

behavior.2 The interpretability of a description is the degree to which it can be 2


J. Colin, T. Fel, R. Cadène, and
readily parsed by humans. For example, a small decision tree with only a few T. Serre, “What I Cannot Predict,
I Do Not Understand: A Human-
nodes tends to be more interpretable than a large decision tree with hundreds of Centered Evaluation Framework
nodes. The explainability of a description is the degree to which it helps humans for Explainability Methods,” Ad-
vances in Neural Information Process-
understand why a system behaves in a particular way.3 ing Systems (NeurIPS), pp. 2832–
2845, 2022.
3
These definitions are often used
12.2 Policy Visualization interchangeably depending on the
context.
One way to understand the behavior of an agent is to visualize its policy in a way
that is readily interpretable by humans. For example, we might visualize how
the policy affects the state of the system over time. We can generate these plots

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12.3. feature importance 227

by performing rollouts of the policy and plotting the state of the system at each
time step. This visualization can help us understand how the policy behaves in 1 20
different scenarios and identify potential failure modes. Figure 12.1 shows an
example of this visualization technique for the aircraft collision avoidance and 0 0

w
inverted pendulum policies.
If the policy is Markov and therefore depends only on the current state, we can 1 20

also visualize it directly by plotting the action taken by the agent in each state. If 1 0 1
the state space is two-dimensional as in the inverted pendulum example, we can q

plot the action taken by the agent as a two-dimensional heatmap (figure 12.2). For Figure 12.2. Visualization of the
higher-dimensional state spaces, we often need to apply dimensionality reduction actions taken by the inverted pen-
dulum policy. The colors represent
techniques to visualize the policy. One common technique is to fix all but two of the torque applied in each state.
the state variables, which become associated with the vertical and horizontal axes.
We can indicate the action for every state with a color. Example 12.1 demonstrates
this technique for the collision avoidance policy.
Instead of fixing the hidden state variables, we could also use various tech-
niques to aggregate over them (figure 12.3). One method involves partitioning
the state space into a set of regions and keeping track of the actions taken in
each region over a series of rollouts. We can then aggregate over these actions by
plotting the mean or mode of the actions taken in each region. One benefit of this
technique is that it relies only on rollouts of the policy and therefore extends to
non-Markovian policies. Because all states may not be reachable in practice, some
areas of the policy plot may have no data associated with it.

12.3 Feature Importance

Feature importance algorithms allow us to understand the contribution of various


input features to the overall behavior of a system. We can use this analysis, for
example, to identify the features of an observation that are most important to
the agent’s decision or to identify the disturbances in a trajectory that have the
greatest effect on its outcome. In this section, we use the term feature to refer to
any component of a system trajectory that might affect the outcome. Features
could include the states, observations, actions, or disturbances of the system. We
can also derive more complex features by combining these basic features. For
example, we could create features for an aircraft collision avoidance system by
grouping states together into different configurations that represent different
relative positions of our aircraft and the intruder.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
228 chap ter 12. ex plainability

The figure in the caption shows policy plots for the aircraft collision avoidance Example 12.1. Aircraft collision
avoidance policy when the rela-
policy when the relative vertical rate is fixed at 0 m/s and 4 m/s, and the tive vertical rate is fixed at 0 m/s
previous action is fixed at no advisory. The red aircraft represents the relative (left) and 4 m/s (right),and the
previous action is fixed at no advi-
location of the intruder aircraft. We can use these plots to explain the behavior sory. The colors represent the ac-
of the policy in these scenarios. For example, we can see that when the relative tion taken by the agent in each
vertical rate is fixed at zero, the policy advises our aircraft to climb when it is state.

above the intruder and descend when it is below the intruder. This behavior
is aligned with our objective of avoiding collisions.
The plot on the left also reveals some potentially unexpected behaviors. For
example, when the time to collision is near zero and a collision is imminent,
the policy results in no advisory. This behavior may prompt us to perform
further analysis. For example, a counterfactual analysis (see section 12.5)
reveals that a collision is inevitable in this scenario regardless of the action
taken by the agent due to limits on the vertical rate of the aircraft.

dh = 0 m/s dh = 4 m/s
400
no advisory
descend 200
climb
h (m)

200

400
40 30 20 10 0 40 30 20 10 0
tcol (s) tcol (s)

Mean Mode Figure 12.3. Result of different ag-


gregation methods for plotting the
400
no data four-dimensional collision avoid-
no advisory ance policy using data from 10,000
200
rollouts. On the left, we plot the
descend
mean of the vertical rate action
h (m)

0 climb
taken by the agent in each state.
On the right, we plot the action
200 taken most frequently by the agent
in each state.
400
40 30 20 10 0 40 30 20 10 0
tcol (s) tcol (s)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12.3. feature importance 229

The results of a feature importance analysis can lead to explanations of system


behavior that allow us to check fairness and calibrate trust. For example, we could
use feature importance to ensure that a loan approval system is not focusing on 4
We want to ensure that the classi-
fier is not exploiting spurious corre-
protected characteristics such as race or gender when making decisions. We could
lations between the input and out-
also use feature importance to calibrate trust by ensuring that the agent is focusing put. Neural networks are prone to
on the features required to make a decision. For instance, we could check that an this type of behavior. R. Geirhos,
J.-H. Jacobsen, C. Michaelis, R.
image classifier is not focusing on irrelevant portions of the image when making Zemel, W. Brendel, M. Bethge, and
decisions.4 We can also use feature importance to inform the design of future F. A. Wichmann, “Shortcut Learn-
ing in Deep Neural Networks,”
systems by identifying the features that are most important in causing the system
Nature Machine Intelligence, vol. 2,
to fail. no. 11, pp. 665–673, 2020.
There are a variety of ways to define the importance of a particular feature. One
definition involves holding all features other than the feature of interest constant 5
We could also imagine corrupt-
and observing the effect on the system’s behavior when varying that feature. ing all features except the feature
Techniques that use this definition are often referred to as sensitivity analysis of interest and observing the ef-
fect on the system’s behavior. This
techniques.5 This definition, however, focuses only on the effect of the feature of technique is sometimes referred
interest by itself and does not consider its interaction with other features. Another to as causal mediation analysis. J.
Pearl, “Direct and Indirect Effects,”
definition of feature importance involves examining the effect of the feature of in Conference on Uncertainty in Arti-
interest in the context of the other features. Example 12.2 provides a scenario ficial Intelligence (UAI), 2001.
where considering the interactions between features produces a different result.
This section presents techniques for determining feature importance using both
definitions.

Low High
12.3.1 Sensitivity Analysis Sensititivity Sensititivity

Sensitivity analysis techniques allow us to understand how a particular output


changes when a single feature is changed. Examples of possible outputs include
the robustness of a trajectory or the agent’s decision at a single time step. If we
change the value of an input that has high sensitivity, we expect a large change in
the output, while if we change an input with low sensitivity, we expect a small
change in the output. Figure 12.4 illustrates this concept.
Figure 12.4. The trajectories show
One way to approximate sensitivity is to randomly perturb a single feature and
the effect of randomly changing the
observe the effect on the system’s behavior. By repeating this process multiple disturbance at a single time step on
times and taking the standard deviation or some other variability metric of the the rest of the trajectory (blue). The
system on the right has a higher
outcome of each trial, we obtain a measure of sensitivity. To evaluate the sensitivity sensitivity to the disturbance ap-
of a decision at a single point in time, we often perturb one component of the plied at the time step than the sys-
tem on the left, resulting in a wider
observation and observe the effect on the decision (example 12.3). To evaluate the
variety of outcomes.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
230 chap ter 1 2. explainability

Consider a wildfire scenario modeled as a grid where each cell is either Example 12.2. Motivation for con-
sidering the interactions between
burning or not burning. At each time step, there is a 30 % chance that a cell features when determining feature
that was not burning at the previous time step will be burning if at least one importance.
of its neighbors was burning. The plots below show an example of a current
state st and the probability that each cell is burning at the next time step
p(st+1 ) (darker cells indicate higher probability). Suppose we are interested
in understanding the features that are most important in determining the
probability that the cell in the upper right corner will burn.

st p ( s t +1 )

⇤ ⇤

For this example, we will focus specifically on the feature that indicates
whether the cell directly to the left of the upper right cell is burning. We
can test the first definition of feature importance by changing that cell to
not burning while holding all other cells constant and observing the effect
on the probability that the upper right cell will burn. In this case (leftmost
plots), the probability that the upper right cell will be burning at the next
time step does not change. Therefore, we will conclude that this cell has no
contribution to the output. However, if we remove fire from both this cell
and the cell below the upper right cell (rightmost plots), the upper right
cell changes to zero probability of burning at the next time step. The second
definition of feature importance considers the interaction between these two
features and would conclude that the cell does contribute to the output.

s0t p(s0t+1 ) s00t p(s00t+1 )

⇤ ⇤ 4 ⇤ ⇤

0
024

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12.3. feature importance 231

Suppose we have an agent that selects a steering angle for an aircraft based Example 12.3. Sensitivity analysis
at a single time step. Brighter pix-
on runway images from a camera mounted on its wing. Given a particular els in the sensitivity map indicate
input image, we can generate a sensitivity map to identify the pixels that are pixels with higher sensitivity.
most important in determining the steering angle by fixing all but the pixel
of interest and checking its effect on the steering angle output. The results
are shown below, where the left image is the original image and the right
image is the sensitivity map. This analysis indicates that the agent is focusing
on the portion of the runway in front of it where the lines are most visible.

Original Image Sensitivity Map

sensitivity of a trajectory, we can perturb the disturbance, observation, or action at


one time step and measure the effect on the rest of the trajectory (example 12.4).
Algorithm 12.1 estimates the sensitivity of the robustness of a particular tra-
jectory with respect to the disturbance at each time step. For each time step in
the trajectory, the algorithm perturbs the disturbance m times and computes the
robustness of the resulting trajectories. The algorithm then returns the standard
deviation of the change in robustness for each time step. The sensitivity of the
robustness of a trajectory with respect to its disturbances can be used to identify
the disturbances that have the greatest effect on the outcome of the trajectory.
Because algorithm 12.1 requires performing multiple rollouts for each time
step, it tends to be ineffecient for high-dimensional systems with many distur-
bances and long time horizons. If the output of interest is differentiable with
respect to the input features, we can reduce computational cost by calculating the
sensitivity using saliency maps. Saliency maps are a type of sensitivity map that
use gradients to identify inputs that are most important, or salient, in determining
a particular outcome. We can apply saliency maps to measures the sensitivity of
both individual decisions and the outcomes of full trajectories.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
232 chap ter 1 2. ex plainability

We can use sensitivity analysis to understand the effect of disturbances on the Example 12.4. Sensitivity analysis
over a full trajectory. Brighter col-
outcome of a trajectory. For example, consider an inverted pendulum system ors in the sensitivity map of the
in which the agent’s observation of its current angle is subject to a noise inverted pendulum trajectory indi-
cate higher sensitivity. The black
disturbance. We can estimate the sensitivity of the robustness of a trajectory line shows the true angle of the
with respect to its disturbances by perturbing the disturbances at each time pendulum at each time step and
step and observing the effect on the robustness of the trajectory. The results the colored markers indicate the
noisy observation of the current an-
on a given failure trajectory are shown below. This analysis indicates that gle at each time step.
small changes in the disturbances at the beginning of the trajectory have a
large effect on the robustness of the trajectory. Furthermore, the disturbances
applied towards the end of the failure trajectory have little to no effect because
the controller is saturated and the system cannot recover.

1
q (rad)

1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (s)

struct Sensitivity Algorithm 12.1. Algorithm for esti-


x # vector of trajectory inputs (s, 𝐱 = extract(sys.env, x)) mating the sensitivity of the robust-
perturb # x′ = perturb(x, t) ness of a trajectory with respect to
m # number of samples per time step its disturbances. It takes as input
end a vector of trajectory features for
the current trajectory we want to
function describe(alg::Sensitivity, sys, ψ) evaluate. These feature can be con-
m, x, perturb = alg.m, alg.x, alg.perturb verted to a trajectory using the sys-
s, 𝐱 = extract(sys.env, x) tem specific extract function. The
τ = rollout(sys, s, 𝐱) perturb function generates a new
ρ₀ = robustness([step.s for step in τ], ψ.formula) trajectory feature vector by perturb-
sensitivities = zeros(length(τ)) ing the feature at a particular time
for t in eachindex(τ) step. The algorithm then computes
x′s = [perturb(x, t) for i in 1:m] the robustness of the perturbed tra-
τ′s = [rollout(sys, extract(sys.env, x′)...) for x′ in x′s] jectories and returns the standard
ρs = [robustness([st.s for st in τ′], ψ.formula) for τ′ in τ′s]
deviation of the resulting change
sensitivities[t] = std(abs.(ρs .- ρ₀))
in robustness for each time step.
end
return sensitivities
end

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12.3. feature importance 233

struct GradientSensitivity Algorithm 12.2. Algorithm for ap-


x # vector of trajectory inputs (s, 𝐱 = extract(sys.env, x)) proximating the sensitivity of the
end robustness of a trajectory with re-
spect to its disturbances using gra-
function describe(alg::GradientSensitivity, sys, ψ) dients. It takes as input a vector
function current_robustness(x) of trajectory features for the cur-
s, 𝐱 = extract(sys.env, x) rent trajectory we want to evaluate.
τ = rollout(sys, s, 𝐱) These feature can be converted to a
return robustness([step.s for step in τ], ψ.formula) trajectory using the system specific
end extract function. The algorithm
computes the robustness of the tra-
return ForwardDiff.gradient(current_robustness, alg.x) jectory and returns the gradient of
end the robustness with respect to the
input features.

Sensitivity Figure 12.5. Sensitivity of the ro-


bustness of a trajectory with re-
1 spect to its disturbances for an
inverted pendulum system calcu-
lated using algorithm 12.2 com-
q (rad)

pared to using algorithm 12.1. The


0
black line shows the true angle
of the pendulum at each time
step and the colored markers indi-
1
cate the noisy observation of the
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 current angle at each time step.
Time (s) Brighter colors indicate higher sen-
sitivity. The gradient calculation
provides values similar to the sensi-
Gradient Magnitude
tivity estimate from algorithm 12.1.
1
q (rad)

1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (s)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
234 chap ter 1 2. explainability

A simple way to produce a saliency map given a set of inputs is to take the 6
D. Baehrens, T. Schroeter, S.
Harmeling, M. Kawanabe, K.
gradient of the output of interest with respect to the inputs.6 The saliency of a
Hansen, and K.-R. Müller, “How
particular input is related to the magnitude of the gradient at that input. A high to Explain Individual Classi-
gradient magnitude indicates that small changes in the input will result in large fication Decisions,” Journal of
Machine Learning Research, vol. 11,
changes in the output. In other words, inputs with high gradient values are more pp. 1803–1831, 2010.
salient and indicate higher sensitivity. This method is often used to determine the
7
K. Simonyan, A. Vedaldi, and A.
components of an observation (such as the pixels of an image) that contribute
Zisserman, “Deep Inside Convolu-
most to an agent’s decision.7 We can also use it to approximate sensitivity over tional Networks: Visualising Image
a full trajectory by taking the gradient of a performance measure with respect Classification Models and Saliency
Maps,” in International Conference
to input features such as actions or disturbances. Algorithm 12.2 measures the on Learning Representations (ICLR),
sensitivity of the robustness of a trajectory with respect to its disturbances, and 2014.
figure 12.5 shows an example on the inverted pendulum system. 8
For image inputs in particular, it
has also been shown that there are
While algorithm 12.2 is more computationally efficient than algorithm 12.1, it
sometimes meaningless local vari-
is limited by its local nature. Important input features, for example, often saturate ations in gradients that can lead to
the output function of interest, causing the gradient to be small even when the noisy sensitivity maps. D. Smilkov,
N. Thorat, B. Kim, F. Viégas, and
feature is important.8 The integrated gradients9 algorithm addresses this limitation M. Wattenberg, “Smoothgrad: Re-
by averaging the gradient along the path between a baseline input and the input moving Noise by Adding Noise,”
in International Conference on Ma-
of interest (figure 12.6). The choice of baseline depends on the context. For images,
chine Learning (ICML), 2017.
a common choice is a black image (figure 12.7). For disturbances, we can set all 9
M. Sundararajan, A. Taly, and
disturbances to zero. Q. Yan, “Axiomatic Attribution
for Deep Networks,” in Interna-
Algorithm 12.3 calculates the sensitivity of the robustness of a trajetory with
tional Conference on Machine Learn-
respect to the disturbances at each time step using integrated gradients. It takes ing (ICML), 2017.
m steps along the path between the baseline and the current input and computes
the gradient of the robustness at each step. The algorithm then returns the av-

Figure 12.6. Example of the inte-


Integrated Gradients grated gradients algorithm for an
Saturated Gradient input feature x. While x has a sig-
nificant effect on the output func-
tion f ( x ), the gradient at the cur-
rent value of x is small because
f ( x ) is saturated (left). If we av-
f (x)

erage the gradient of the output


current value function f ( x ) along the path be-
baseline tween a baseline input and the cur-
rent value of the input of interest,
we can capture this effect (right).
x x Brighter colors of the gradient lines
indicate higher magnitudes.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12.3. feature importance 235

Original Image Gradient Values Figure 12.7. Comparison of the


gradients of the output of an air-
craft taxi network that selects a
steering angle based on runway im-
ages from a camera mounted on
its wing. As a moves from 0 to 1,
the image moves from a baseline
black image to the original image.
The gradients are much higher for
0 0.2 0.4 0.6 0.8 1 the pixel marked in red indicating
a that it has a larger effect on the out-
put of the network. However, if we
only computed the gradient of the
original image, the effect would ap-
pear similar to the effect of the pixel
marked in blue.
struct IntegratedGradients Algorithm 12.3. Algorithm for ap-
x # vector of trajectory inputs (s, 𝐱 = extract(sys.env, x)) proximating the sensitivity of the
b # vector of baseline inputs robustness of a trajectory with re-
m # number of steps for numerical integration spect to its disturbances using inte-
end grated gradients. It takes as input a
vector of trajectory features for the
function describe(alg::IntegratedGradients, sys, ψ) current trajectory we want to eval-
function current_robustness(x) uate, a vector of baseline features,
s, 𝐱 = extract(sys.env, x) and the number of steps for nu-
τ = rollout(sys, s, 𝐱) merical integration. For each step
return robustness([step.s for step in τ], ψ.formula) along the path between the base-
end line and the current input, the al-
αs = range(0, stop=1, length=alg.m) gorithm computes the gradient of
xs = [(1 - α) * alg.b .+ α * alg.x for α in αs] the robustness. It then returns the
grads = [ForwardDiff.gradient(current_robustness, x) for x in xs] average gradient at each time step.
return mean(hcat(grads...), dims=2)
end

Original Image Sensitivity Gradient Integrated Gradients Figure 12.8. Comparison of the
sensitivity descriptions using algo-
rithms 12.1 to 12.3 for an aircraft
taxi system that selects a steering
angle from an image observation.
The sensitivity map focuses on the
portion where the edge and cen-
ter lines are most apparent, while
the gradient-based methods focus
only on the edges of the runway.
The integrated gradients method
provides a smoother map than the
single gradient approach.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
236 chap ter 1 2. explainability

erage gradient at each time step. As m approaches infinity, the average gradient
approaches the integral of the gradient along the path. Figure 12.8 compares the
sensitivity estimates from algorithms 12.1 to 12.3 for an aircraft taxi system. All
three methods produce slightly different descriptions of the agent’s behavior, and
in general, the most appropriate sensitivity estimate is application dependent.

12.3.2 Shapley Values


Computing the Shapley value of a feature allows us to evaluate its importance
in the context of its interaction with other features. For example, it may be a
combination of multiple disturbances that leads to a failure rather than a single
disturbance. While sensitivity analysis techniques miss this interaction because
they only vary one feature at a time, Shapley values capture it by varying all
possible subsets of features.10 10
Shapley values were originally
Suppose we have a set of feature indices I = {1, . . . , n}, and let Is ✓ I be a developed in the context of game
theory in economics and are
subset of these features. Given a set of values for the features x and a function f that named for American mathemati-
maps these values to an outcome, we define the following function to represent cian and economist Lloyd Shapley
(1923–2016). L. S. Shapley, “Notes
the expectation of the outcome while holding the features in Is constant: on the N-Person Game—II: The
Value of an N-Person Game,” 1951.
f Is (x) = E [x0 | xi0 = xi , i 2 Is ] (12.1)

The Shapley value fi of feature i is then defined as the average marginal contri-
bution of feature i to the expectation of the outcome over all possible subsets of
features:
|Is |!(n |Is | 1) !
fi (x) = Â ( f Is [{i} (x) f Is (x)) (12.2)
Is ✓I\{i }
n!

Intuitively, computing the Shapley value of feature i involves looping over all
possible subsets of features that do not include i and computing the difference in
the expectation of the outcome when adding i to the subset. The constant factor
in equation (12.2) ensures that subsets of different sizes are weighted equally. In
general, Shapley values are expensive (often intractable) to compute due to the
large number of possible subsets. For example, a function with 100 input features
has 6.3 ⇥ 1029 possible subsets.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12.4. policy explanation through surrogate models 237

We can approximate the Shapley value using randomly sampled subsets.11 11


E. Štrumbelj and I. Kononenko,
First, we rewrite equation (12.2) as follows: “Explaining Prediction Models
and Individual Predictions with
Feature Contributions,” Knowledge
1
fi (x) = Â [ f P1:j (x) f P1:j 1 (x)] (12.3) and Information Systems, vol. 41,
n! P 2p (n) pp. 647–665, 2014.

where p (n) represents the set of all possible permutations of n elements, j is the
index in the permutation P that corresponds to feature i, and P1:j represents the
first j elements of P . We can then approximate the Shapley value using sampling.
For each sample, we randomly permute the features and compute the difference
in the expectation of the outcome when adding feature i to the features before it
in the permutation.
Algorithm 12.4 estimates the Shapley values for the disturbances in a trajectory
to determine their contribution to the robustness of the trajectory. It takes in a
current trajectory t with disturbance trajectory x and a number of samples per
time step m. For each time step in the trajectory, the algorithm randomly samples
another disturbance trajectory w by performing a rollout using the nominal
trajectory distribution.12 It then samples a random permutation P of the time 12
This step requires that the distur-
steps and performs a rollout in which the disturbances are taken from x for the bances sampled at each time step
are independent of one another.
time steps in P1:j and from w for all other time steps. It similarly performs a This assumption may break if the
rollout in which the disturbances are taken from x for the time steps in P1:j 1 and disturbances depend on the states,
actions, or observations.
from w for all other time steps. The algorithm then computes the difference in
the robustness of the two rollouts and averages the differences over m sampled
permutations to estimate the Shapley value of each disturbance.
Figure 12.9 shows the Shapley values for the disturbances of the inverted
pendulum trajectory used in example 12.3 and figure 12.5. The Shapley values
differ from the sensitivity estimates because they account for interactions between
disturbances. If we remove groups of disturbances with high Shapley values, it
produces a large change in the outcome.

12.4 Policy Explanation through Surrogate Models

For agents with complex policies, it may be difficult to understand the reasoning
behind their decisions. In such cases, we can build surrogate models to approximate
the policy with a model that is easier to interpret. A good surrogate model should
have the following characteristics:

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
238 chap ter 1 2. ex plainability

struct Shapley Algorithm 12.4. Estimating the


τ # current trajectory Shapley values of the disturbances
m # number of samples per time step in a trajectory. The algorithm takes
end as input a trajectory t and a num-
ber of samples per time step m. For
function shapley_rollout(sys, s, 𝐱, 𝐰, inds) each time step in the trajectory, the
τ = [] algorithms samples a random vec-
for t in 1:length(𝐱) tor of disturbances by performing
x = t ∈ inds ? 𝐱[t] : 𝐰[t] a rollout using the nominal trajec-
o, a, s′ = step(sys, s, x) tory distribution. It then samples
push!(τ, (; s, o, a, x)) a random permutation of the time
s = s′ steps and locates the current time
end step in the permutation. Using the
return τ shapley_rollout function, the al-
end gorithm computes the difference
in the robustness of the trajectory
function describe(alg::Shapley, sys, ψ)
when adding the disturbance at the
τ, m = alg.τ, alg.m
current time step to the subset of
p = NominalTrajectoryDistribution(sys, length(alg.τ))
disturbances in the permutation. It
𝐱 = [step.x for step in τ]
then averages the differences over
ϕs = zeros(length(τ))
for t in eachindex(τ) m sampled permutations to esti-
for _ in 1:m mate the Shapley value of each dis-
𝐰 = [step.x for step in rollout(sys, p)] turbance.
𝒫 = randperm(length(τ))
j = findfirst(𝒫 .== t)
τ₊ = shapley_rollout(sys, τ[1].s, 𝐱, 𝐰, 𝒫[1:j])
τ₋ = shapley_rollout(sys, τ[1].s, 𝐱, 𝐰, 𝒫[1:j-1])
ϕs[t] += robustness([step.s for step in τ₊], ψ.formula) -
robustness([step.s for step in τ₋], ψ.formula)
end
ϕs[t] /= m
end
return ϕs
end

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12.4. policy explanation through surrogate models 239

Shapley Values Figure 12.9. Shapley values for the


disturbances in an inverted pen-
1 dulum failure trajectory. The black
line shows the true angle of the
pendulum at each time step, and
q (rad)

the colored markers indicate the


0
noisy observation of the current
angle at each time step. Brighter
colors indicate higher Shapley val-
1 ues. The Shapley values are high-
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 est for distubances that cause the
Time (s) agent to think that the pendulum
is further from tipping over than it
actually is, causing it to apply too
Important Features
small of a torque to move toward
upright. The second plot shows
1
the trajectory with the four distur-
bances with the highest Shapley
values marked in red. If we remove
q (rad)

0 these disturbances one at a time


(blue trajectories in the third plot),
we cause small changes in the out-
1 come. However, if we remove all
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 four disturbances (purple trajec-
tory in the third plot) at once, we
Time (s)
cause a large change in the out-
come.
Removing Important Features

1
q (rad)

0
one at a time all four
1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (s)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
240 chap ter 12. ex plainability

• High Fidelity: The surrogate model should accurately represent the policy. If
the surrogate model does not adequately represent the policy, the explanations
it provides may be misleading.

• High Interpretability: The surrogate model should be easily interpretable by


humans. If the surrogate model is too complex, it may be difficult to understand
the reasoning behind the decisions.

In general, there is a tradeoff between fidelity and interpretability. A more complex


model may be higher fidelity but less interpretable, while a simpler model may
be more interpretable but lower fidelity.
Surrogate models can provide local explanations or global explanations of a policy.
Local explanations provide insight into a single decision, while global explana-
tions provide insight into the full policy. To create a local surrogate model, we
create a dataset of observations and corresponding decisions near the observation
of interest and fit a model to this dataset.13 For global surrogate models, we gather 13
It is common to weight these data
data across the entire observation space. When selecting a model class to fit the points with higher weights for ob-
servations that are closer to the ob-
data, we must consider the tradeoff between fidelity and interpretability. This servation of interest. M. T. Ribeiro,
section discusses this tradeoff for two common model classes used as surrogate S. Singh, and C. Guestrin, “‘Why
Should I Trust You?’ Explaining
models. the Predictions of Any Classifier,”
in ACM SIGKDD International Con-
ference on Knowledge Discovery and
12.4.1 Linear Models Data Mining, 2016.

One common choice for a surrogate model is a linear model. Linear models have
the form
n
f (x) = Â wi x i + b (12.4)
i =1

where xi is a feature of the observation, wi is a weight for feature i, and b is the bias
term. If the action space is discrete, we may apply the logistic or softmax function
to the output of the linear model to obtain probabilities for each action. Linear
surrogate models can be used to determine feature importance. The magnitudes
of the weights of the linear model indicate the contribution of each feature to the
agent’s decision. Figure 12.10 demonstrates how to use a linear surrogate model
to describe the behavior of a collision avoidance policy in two different regions of
the observation space. This technique is particularly useful for high-dimensional
observations, where it may be difficult to visualize the policy directly.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12.4. policy explanation through surrogate models 241

Original Policy Linear Approximation Feature Weights Figure 12.10. Linear surrogate
model fit to samples in two differ-
ent local regions (highlighted cir-
100 cles) of the observation space for
a collision avoidance policy. The
left column shows the original pol-
icy where blue corresponds to the
h (m)

0
climb action, green corresponds to
the descend action, and white cor-
responds to no advisory. The col-
100 umn in the middle shows the lin-
ear surrogate model fit to the sam-
ples in the highlighted circle indi-
cated by the purple dots. The right
column shows the feature weights
100 of the linear surrogate model for
each state variable. The linear sur-
rograte model is accurate in the lo-
cal region where it was fit, but may
h (m)

0
not be accurate in other regions
of the observation space. Based
on the feature weights, the linear
100 surrogate model indicates that the
agent’s decision depends on both
40 30 20 10 0 40 30 20 10 0 h tcol the relative altitude and time to col-
tcol (s) tcol (s) lision when the relative altitude is
around 50 m. In contrast, when the
relative altitude is around 0 m, the
agent’s decision primarily depends
on the relative altitude.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
242 chap ter 12. explainability

For complex policies, a model that is simply a linear function of observation


variables may not provide sufficient fidelity. We may need to add more complex
features of the observation to the model. For example, we could add polynomial
features of the observation to the linear model to capture non-linear relationships
between the features. Alternatively, we can train a neural network to learn a
set of nonlinear features that can be linearly combined, but these features are
generally not interpretable. Figure 12.11 shows the tradeoff between fidelity and 14
R. Tibshirani, “Regression
Shrinkage and Selection via
interpretability for a linear model with polynomial features. A common technique
the Lasso,” Journal of the Royal
to simplify linear models with many features is to encourage sparsity in the Statistical Society Series B: Statistical
weights using a technique called LASSO regression.14 The features with nonzero Methodology, vol. 58, no. 1, pp. 267–
288, 1996.
weights in a sparse linear model are tend to be more to the decision.

Low High Figure 12.11. Tradeoff between in-


Fidelity Interpretability terpretability and fidelity in a lin-
ear surrogate model. Each row cor-
responds to a linear model with
first, second, and third order fea-
tures, respectively. The left column
shows the decision boundary of
the surrogate model, with the black
point indicating the state for which
h tcol the model is evaluated. The right
column shows the feature weights
of the surrogate model. The plot be-
low shows the original policy and
the sampled points. As the order of
the polynomial features increases,
the model becomes more accurate
in the local region where it was fit
at the cost of lower interpretability.
h tcol t2col htcol h2
100
h (m)

100

40 30 20 10 0
h tcol htcol tcol (s)
t2col h2 t3col ht2col h2 t col h3
High Low
Fidelity Interpretability

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12.5. counterfactual explanations 243

12.4.2 Decision Trees


Decision trees model policies as a series of simple decisions.15 Each node in the tree 15
The DecisionTree.jl package
represents a decision based on a feature of the observation, and the leaf nodes can be used to train a decision tree
model from data.
represent the action taken by the agent. Decisions are typically represented as
binary splits, where we follow the left branch in the tree if the feature value is
less than a threshold and the right branch if the feature value is greater than or
equal to the threshold. Example 12.5 shows a simple decision tree for a slice of
the collision avoidance policy.
The maximum depth of the decision tree controls the tradeoff between fidelity
and interpretability. Shallow decision trees tend to be more interpretable because
they do not require many decisions to make a prediction. However, shallow trees
are less expressive and may miss important features of the policy. Figure 12.12
shows the tradeoff between fidelity and interpretability for decision trees.

12.5 Counterfactual Explanations

A counterfactual is a hypothetical scenario that describes how an outcome would


change if events had unfoldly differently. Counterfactual explanations explain the
behavior of a model by identifying the smallest change to the input that would
result in a different outcome. We can frame the problem of generating a counter-
factual explanation as a multiobjective optimization problem, in which our goal
is to maximize the following four objectives:16 16
S. Dandl, C. Molnar, M. Binder,
and B. Bischl, “Multi-Objective
1. Change in outcome: The counterfactual input should result in an outcome dif- Counterfactual Explanations,” in
International Conference on Parallel
ferent from that obtained with the original input. If our goal is to change a Problem Solving from Nature, 2020.
trajectory t from a failure to a success, we can use the temporal logic robustness
metric as the objective:

f outcome (t 0 ) = robustness(t 0 , y) (12.5)

where t 0 is the counterfactual trajectory. By maximizing robustness, we move


toward a trajectory that satisfies the safety property y.

2. Distance to original input: The counterfactual input should be close to the orig-
inal input t to ensure that the change is minimal, resulting in the following
objective:
f close (t 0 ) = kt 0 t k p (12.6)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
244 chap ter 1 2. ex plainability

Suppose we want to train a decision tree to approximate the slice of the Example 12.5. Simple decision tree
for the collision avoidance policy.
collision avoidance policy shown in example 12.1. The following decision The policy represented by the deci-
tree was trained on a dataset of 100,000 randomly sampled states from the sion tree is shown below.
policy slice. The decision tree has a maximum depth of 2 and uses the state 400
variables to make decisions. Nodes that split using h are shown in black,
200
nodes that split using tcol are shown in gray, and the color of the square leaf

h (m)
0
nodes are the actions taken by the agent.
200
h climb
400
tcol descend 40 30 20 10 0
no advisory tcol (s)
0
<0 0

101 98
< 101 101 < 98 98

With a maximum depth of 2, the decision tree only makes decisions based
on h. If h is positive, the tree selects whether to climb or issue no advisory
based on the magnitude of the relative altitude. Similarly, if h is negative, the
tree selects whether to descend or issue no advisory based on the magnitude
of the relative altitude. The policy represented by the decision tree is shown
in the caption. This decision tree provides a simple, interpretable model of
the agent’s policy. However, the fidelity of the decision tree is limited by its
depth, and it misses some key features of the policy that depend on the time
to collision.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12.5. counterfactual explanations 245

Low High Figure 12.12. Tradeoff between fi-


Fidelity Interpretability delity and interpretability when
training a decision tree surrogate
400 model on the slice of the policy
shown in example 12.1. Each row
200
corresponds to a decision tree with
h (m)

0 a maximum depth of 2, 4, and 6, re-


200 spectively. The left column shows
400 the decision boundary of the surro-
40 30 20 10 0 gate model. The colors correspond
tcol (s) to the colorscheme shown in exam-
ple 12.5. As the maximum depth
of the decision tree increases, the
400 model becomes more accurate at
200
the cost of lower interpretability.
h (m)

200

400
40 30 20 10 0
tcol (s)

400

200
h (m)

200

400
40 30 20 10 0
tcol (s)

High Low
Fidelity Interpretability

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
246 chap ter 1 2. ex plainability

where t 0 is the counterfactual input and k · k p is the p norm.

3. Sparsity of the change: The difference between the original input and the counter-
factual input should be sparse. In other words, the counterfactual input should
differ in only a few features. We can use the following objective

f sparsity (t 0 ) = kt 0 t k0 (12.7)

where k · k0 returns the number of nonzero elements.17 This objective presents 17


This operation is sometimes re-
challenges for gradient-based optimization algorithms because its derivative ferred to as the L0 norm; however,
it is not a proper norm because it
is zero almost everywhere. To use gradient-based optimization, we can use the does not scale properly when mul-
L1 norm, which encourages sparisty in the final solution. tiplied by a scalar.

4. Plausibility: The new input should be a plausible input. We can check plausibil-
ity using the likelihood of the counterfactual trajectory as follows:

f plaus (t 0 ) = p(t 0 ) (12.8)


18
An overview of multiobjective
The four counterfactual objectives are often at odds with one another. For optimization is provided in chapter
example, only making small changes to the input is unlikely to change the outcome. 12 of M. J. Kochenderfer and T. A.
Wheeler, Algorithms for Optimiza-
We can use multiobjective optimization techniques to find counterfactual inputs tion. MIT Press, 2019.
that balance these objectives.18 Algorithm 12.5 creates a single objective function 19
It is also common to use ge-
using a weighted sum of the objectives. We can apply a variety of optimization netic algorithms that encourage di-
algorithms to find the counterfactual input that maximizes the objective function versity in the solutions to find a
diverse set of counterfactual ex-
(see appendix C).19 To ensure compatibility with gradient-based optimization planations. S. Dandl, C. Molnar,
techniques, we omit the objective in equation (12.7) and instead set p = 1 in M. Binder, and B. Bischl, “Multi-
Objective Counterfactual Explana-
equation (12.6) to encourage sparsity. Figure 12.13 shows the generation of a tions,” in International Conference on
counterfactual explanation for a failure of the inverted pendulum system. Parallel Problem Solving from Nature,
2020.

function counterfactual_objective(x, sys, ψ, x₀; ws=ones(3)) Algorithm 12.5. Counterfactual


s, 𝐱 = extract(sys.env, x) objective function that combines
τ = rollout(sys, s, 𝐱) equations (12.5), (12.6) and (12.8).
foutcome = robustness([step.s for step in τ], ψ.formula) The objective function takes in the
fclose = -norm(x - x₀, 1) counterfactual input x, the system
fplaus = logpdf(NominalTrajectoryDistribution(sys, length(𝐱)), τ) sys, the specification ψ, the original
return ws' * [foutcome, fclose, fplaus] input x₀, and a vector of weights ws.
end The algorithm computes each indi-
vidual objective and returns their
weighted sum. We take the loga-
We are often interested in producing counterfactual explanations for inputs rithm of f plaus for numerical stabil-
ity.
that we can control. For example, a counterfactual explanation for a loan approval

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12.5. counterfactual explanations 247

1 Figure 12.13. Generation of a


counterfactual trajectory for the in-
verted pendulum system by chang-
Original

0 ing the disturbances on the mea-


surement of q. The original trajec-
tory is shown in the plot on the top
1 with the disturbance at each time
step shown in black. The remaining
1 plots show the counterfactual tra-
jectories with different numbers of
disturbances changed. The distur-
6 changes

0 bances that differ from the original


trajectory are shown in blue. We
can create these trajectories by de-
1 creasing the relative weighting of
the closeness objective in the coun-
terfactual objective until we gener-
1
ate a trajectory that is no longer a
failure. As noted in example 12.4,
7 changes

the disturbances at the beginning


0
of the trajectory have the most sig-
nificant impact on the outcome be-
1 cause the controller is saturated at
the end of the trajectory.

1
8 changes

1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (s)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
248 chap ter 1 2. explainability

400 Figure 12.14. Counterfactual expla-


nations (blue) for a failure of the
collision avoidance system (black).
200 The arrows represent the direction
of the commanded collision avoid-
ance advisory at each time step. No
h (m)

0 arrow indicated no advisory. The


black arrows represent the origi-
nal trajectory, while the blue ar-
200 row represents the action change
used to generate the counterfac-
tual trajectories. The plot on the
left shows the counterfactual ex-
400
40 30 20 10 0 40 30 20 10 0 planation when holding all other
actions and disturbances constant.
tcol (s) tcol (s)
The plot on the right shows the
counterfactual explanations when
rolling out the trajectory for all
other time steps. In this scenario,
system that involves a change in the income of the applicant is more useful than we can conclude that issuing a de-
scend advisory a few time steps
an explanation that requires a change in their age. While we have control over earlier would have prevented the
the actions of an agent, we often do not have control over the disturbances that failure.
affect the system.
We can generate several different types of counterfactual explanations over
the actions of the agent. The simplest type of counterfactual explanation involves
changing an action at a particular time step or set of time steps while keeping all
other actions constant. This technique is mosts similar to algorithm 12.5. A key
assumption of this method is that the components of the input are independent
of one another. However, changing the action at one time affects the state at the
next time step, which in turn affects the action at the next time step. This cascad-
ing effect breaks the independence assumption and can lead to counterfactual
explanations that are not plausible.
One way to account for the cascading effect is to select actions that maximize the
expected change in the outcome given all possible future actions and disturbances.
If we are searching for counterfactuals that only change the action at one time
step, we can produce a set of counterfactual explanations by performing rollouts
of the policy for the remaining time steps. We can then select the action that 20
S. Dandl, C. Molnar, M. Binder,
and B. Bischl, “Multi-Objective
maximizes the expected change in the outcome. Figure 12.14 shows this technique
Counterfactual Explanations,” in
for the aircraft collision avoidance example. Understanding the effects of changing International Conference on Parallel
multiple actions requires more sophisticated techniques.20 Problem Solving from Nature, 2020.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12.6. failure mode characteriz ation 249

12.6 Failure Mode Characterization

Another way to explain the behavior of a system is to characterize its failure modes.
We can use clustering algorithms to create groupings of failure trajectories that
are similar to one another. Identifying the similarities and differences between
failures helps us understand their underlying causes. One common clustering
algorithm is k-means21 (algorithm 12.6), which groups data points into k clusters 21
This algorithm is also referred
based on their similarity to one another.22 to as Lloyd’s algorithm, named af-
ter Stuart P. Lloyd (1923–2007). S.
To apply k-means, we must first extract a set of real-valued features from each Lloyd, “Least Squares Quantiza-
failure trajectory to use for clustering. Let x represent the set of features from tion in PCM,” IEEE Transactions on
Information Theory, vol. 28, no. 2,
trajectory t and f be a feature extraction function such that x = f(t ). To represent pp. 129–137, 1982.
the clusters C , k-means keeps track of k cluster centroids µ1:k in feature space and 22
A detailed overview of cluster-
assigns each trajectory to the cluster with the closest centroid to its features. We ing algorithms is provided in D.
Xu and Y. Tian, “A Comprehen-
begin by initializing the centroids to the features of k random trajectories. At each sive Survey of Clustering Algo-
iteration, k-means performs the following steps: rithms,” Annals of Data Science,
vol. 2, pp. 165–193, 2015.
1. Assign each trajectory to the cluster with the closest centroid to its feature
vector. In other words, ti is assigned to cluster C j when

j = arg min d(xi , µ j ) (12.9)


j21:k

where d(·, ·) is a distance metric such as the L2 norm.

2. Update the centroids to the mean of the feature vectors of the trajectories in
each cluster such that
1
(12.10)
|C j | tÂ
µj = f(t )
2C j

where |C j | is the number of trajectories in cluster C j .

The algorithms repeats until the centroids converge or a maximum number of


iterations is reached. The k-means algorithm may converge to a local minimum de-
pending on the initialization of the centroids, so it is common to run the algorithm
multiple times with different initializations. Figure 12.15 shows the progression
of the k-means algorithm on failure trajectories of the inverted pendulum system
using the average angle and angular velocity of each trajectory as the features.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
250 chap ter 12. ex plainability

struct Kmeans Algorithm 12.6. The k-means algo-


τs # trajectories to cluster rithm for clustering failure trajecto-
ϕ # feature extraction function (x = ϕ(τ)) ries. The algorithm takes in a set of
d # distance metric function (d(x[i], μⱼ)) trajectories τs, a feature extraction
k # number of clusters function ϕ, the number of clusters k,
max_iter # maximum number of iterations and the maximum number of itera-
end tions max_iter. The algorithm first
extracts the features from each tra-
function describe(alg::Kmeans, sys, ψ) jectory and initializes the centroids
x = [alg.ϕ(τ) for τ in alg.τs] to random trajectories. At each it-
μ = x[randperm(length(x))[1:alg.k]] eration, it assigns each trajectory
𝒞 = Dict(map(j->Pair(j, []), 1:alg.k)) to the cluster with the closest cen-
for _ in 1:alg.max_iter troid based on the L2 norm and up-
map(j->𝒞[j] = [], 1:alg.k) dates the centroids to the mean of
for i in eachindex(x) the feature vectors of the trajecto-
push!(𝒞[argmin([alg.d(x[i], μⱼ) for μⱼ in μ])], i)
ries in each cluster.
end
for j in 1:alg.k
if !isempty(𝒞[j])
μ[j] = mean(x[i] for i in 𝒞[j])
end
end
end
return 𝒞, μ
end

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12.6. failure mode characteriz ation 251

Initialization Iteration 1 Iteration 2 Converged Figure 12.15. Progression of the k-


means algorithm on failure trajecto-
2 ries of the inverted pendulum sys-
tem with k = 2 and the average an-
1
Average w

gle and angular velocity of each tra-


0 jectory as the features. The colors
represent the different clusters at
1 each iteration and the black crosses
represent the cluster centroids. The
2
2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 algorithm converges to a set of clus-
ters that represent two distinct fail-
Average q Average q Average q Average q
ure modes. One failure mode corre-
sponds to trajectories in which the
pendulum falls to the left, and the
1
other cluster corresponds to trajec-
tories in which the pendulum falls
q (rad)

0 to the right.

1
0 2 4 0 2 4 0 2 4 0 2 4
Time (s) Time (s) Time (s) Time (s)

The clustering results help us understand the failure modes of the system. 1
One way to interpret the clusters is to create a prototypical example for each cluster.
The prototypical example for a given cluster is the trajectory that is closest to

q (rad)
its centroid in feature space. By examining the prototypical examples, we can 0

understand the characteristics of each failure mode. Figure 12.16 shows the proto-
typical examples for the final clusters in figure 12.15. At runtime, we can assign 1
new failure trajectories to the cluster with the closest centroid to their features 0 2 4
and use the prototypical examples to explain the failure mode of the trajectory. Time (s)
Algorithm 12.6 requires us to select the number of clusters k, the distance
Figure 12.16. Prototypical exam-
function d, and the feature extraction function f. The clustering results are highly ples of failure modes for in the in-
dependent on these choices. However, selecting the number of clusters and the verted pendulum system using the
clusters in figure 12.15. The proto-
features is often a subjective process that requires domain knowledge. To select
types reveal that one failure mode
the number of clusters, we can try different values for k and select the one that involves the pendulum falling to
results in the most interpretable clusters or that minimizes a clustering objective the left, while the other involves
the pendulum falling to the right.
such as the sum of the squared distances between each point and its cluster
centroid.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
252 chap ter 12. explainability

We can also use domain knowledge to select features that are likely to capture
the underlying causes of the failures. A simple way to select features is to create a
feature vector by concatenating all of the states in the trajectory. We could create
similar feature vectors for the actions, observations, and disturbances. However,
these feature vectors will be high-dimensional and may not result in interpretable
clusters (figure 12.17).

State Features Action Features Disturbance Features Figure 12.17. Clustering failure tra-
jectories of the inverted pendulum
1 system using features consisting
of the states, actions, and distur-
bances of each trajectory, respec-
tively. The colors represent the dif-
q (rad)

0
ferent clusters. The clusters based
on the state and action features
show interpretable failure modes,
1 while the clusters based on the
0 2 4 0 2 4 0 2 4 disturbance features are less inter-
pretable.
Time (s) Time (s) Time (s)

To improve interpretability of the clusters, we can cluster the trajectories based


on temporal logic features. Specifically, we use the parameters of a parametric
signal temporal logic (PSTL) formula as the features for clustering.23 PSTL is an 23
M. Vazquez-Chanlatte, J. V. Desh-
extension of signal temporal logic (section 3.5.2), in which the time constants mukh, X. Jin, and S. A. Seshia,
“Logical Clustering and Learning
in the temporal operators and signal values in the atomic propositions may be for Time-Series Data,” in Interna-
replaced by parameters. PSTL expressions represent template formulas that can tional Conference on Computer Aided
Verification, 2017.
be instantiated to STL formulas with specific values for the parameters.
To perform clustering with PSTL features, we first select a template formula.
We then set the features of each trajectory to the values of the parameters for
which the formula is marginally satisfied. An STL formula is marginally satisfied by
a trajectory if the robustness of the trajectory with respect to the formula is zero.
By plugging these parameters into the template formula, we obtain a temporal
logic formula that describes the behavior of the trajectory.
We can use optimization methods to find the values of the parameters that
marginally satisfy the formula for each trajectory by finding the f that minimizes
kr(t, yf )k where r is the robustness function and yf is the instantiated STL for-
mula with parameters f. Example 12.6 applies this idea to the inverted pendulum
system. We can then perform clustering on the extracted PSTL features to identify

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12.7. summary 253

failure modes. Figure 12.18 shows the clusters of failure trajectories of the inverted 1
pendulum system using the PSTL template in example 12.6.
Clustering using PSTL features requires us to select a template formula. The

q (rad)
template formula should capture the key aspects of the system that are relevant 0

to the failure modes. For systems with complex failure modes, it may be difficult
to hand-design a template formula that captures all the failure modes. In these 1
cases, we can use more sophisticated techniques that build decision trees using a 0 2 4
grammar based on temporal logic.24 Time (s)

Figure 12.18. Clusters of failure tra-


12.7 Summary jectories of the inverted pendulum
system using the PSTL template in
example 12.6 and k = 2. The algo-
• Interpretable descriptions of system behavior are essential for understanding
rithm results in two clusters. The
and calibrating trust. pendulum falls over earlier in the
blue trajectories compared to the
• Policy visualization allows us to interpret the policy that a particular agent purple trajectories.
uses to make decisions. R. Lee, M. J. Kochenderfer, O. J.
24

Mengshoel, and J. Silbermann, “In-


terpretable Categorization of Het-
• Feature importance methods such as saliency maps and Shapley values allow
erogeneous Time Series Data,” in
us to understand the impact of different features on the behavior of a system. SIAM International Conference on
Data Mining, 2018.
• Surrogate models allow us to explain the policy of a complex system using a
simpler model and must balance between fidelity and interpretability.

• Counterfactual explanations provide insights into the decision-making process


of a system by showing how changes in the input affect the output.

• We can characterize the failure modes of a system by clustering them using


interpretable features.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
254 chap ter 1 2. ex plainability

The following STL formula specifies that the angle of the pendulum should Example 12.6. Example of a PSTL
template formula for the inverted
not exceed p/4 for the first 200 time steps: pendulum system. The plots show
⇣ the robustness of the formula for
p⌘ different values of f. Our goal is
y = ⇤[0,200] q <
4 to find the value of f that causes a
given trajectory to marginally sat-
If we replace the time bound with a parameter f, we obtain the following isfy the formula.
PSTL template formula:
⇣ p⌘
yf = ⇤[0,f] q <
4
The plots below show the robustness of the formula for different values of f.
The plot on the left shows a value for f such that the trajectory satisfies the
formula, the plot in the middle shows a value for f that marginally satisfies
the formula, and the plot on the right shows a value for f such that the
trajectory does not satisfy the formula.

Satisfied Marginally Satisfied Not Satisfied

1
f = 2.45 f = 3.85 f = 4.05
q (rad)

0 2 4 0 2 4 0 2 4
Time (s) Time (s) Time (s)

We can find the value of f that marginally satisfies yf by searching for the
value that causes the robustness to be as close as possible to zero. For this
simple formula, we can solve the optimization problem using a grid search
over the values of f. The the value of f that marginally satisfies the formula
will be the time just before the magnitude of the angle of the pendulum
exceeds p/4.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
A Problems

A.1 Coming Soon


B Mathematical Concepts

B.1 Coming Soon


C Optimization

C.1 Coming Soon


D Neural Networks

D.1 Coming Soon


E Julia

E.1 Coming Soon


References

1. H. Abdi and L. J. Williams, “Principal Component Analysis,” Wiley Interdisciplinary


Reviews: Computational Statistics, vol. 2, no. 4, pp. 433–459, 2010 (cit. on p. 164).
2. A. Ahmadi-Javid, “Entropic Value-At-Risk: A New Coherent Risk Measure,” Journal
of Optimization Theory and Applications, vol. 155, no. 3, pp. 1105–1123, 2011 (cit. on
p. 24).
3. M. Akintunde, A. Lomuscio, L. Maganti, and E. Pirovano, “Reachability Analysis
for Neural Agent-Environment Systems,” in International Conference on Principles of
Knowledge Representation and Reasoning, 2018 (cit. on p. 197).
4. M. Althoff, “Reachability Analysis of Nonlinear Systems Using Conservative Poly-
nomialization and Non-Convex Sets,” in International Conference on Hybrid Systems:
Computation and Control, 2013 (cit. on p. 180).
5. M. Althoff and G. Frehse, “Combining Zonotopes and Support Functions for Effi-
cient Reachability Analysis of Linear Systems,” in IEEE Conference on Decision and
Control (CDC), 2016 (cit. on p. 161).
6. M. Althoff, G. Frehse, and A. Girard, “Set Propagation Techniques for Reachability
Analysis,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 4, pp. 369–
395, 2021 (cit. on p. 154).
7. M. Althoff, G. Frehse, and A. Girard, “Set Propagation Techniques for Reachability
Analysis,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 4, pp. 369–
395, 2021 (cit. on p. 171).
8. M. Althoff, O. Stursberg, and M. Buss, “Reachability Analysis of Nonlinear Systems
with Uncertain Parameters Using Conservative Linearization,” in IEEE Conference
on Decision and Control (CDC), 2008 (cit. on pp. 180, 181).
9. T. L. Arel, Safety Management System Manual, Air Traffic Organization, Federal Avia-
tion Administration, 2022 (cit. on p. 13).
10. M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A Tutorial on Particle
Filters for Online Nonlinear/non-Gaussian Bayesian Tracking,” IEEE Transactions
on Signal Processing, vol. 50, no. 2, pp. 174–188, 2002 (cit. on p. 125).
266 references

11. T. W. Athan and P. Y. Papalambros, “A Note on Weighted Criteria Methods for


Compromise Solutions in Multi-Objective Optimization,” Engineering Optimization,
vol. 27, no. 2, pp. 155–176, 1996 (cit. on p. 28).
12. D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K.-R. Müller,
“How to Explain Individual Classification Decisions,” Journal of Machine Learning
Research, vol. 11, pp. 1803–1831, 2010 (cit. on p. 234).
13. C. Baier and J.-P. Katoen, “Principles of Model Checking,” in MIT Press, 2008, ch. 1
(cit. on p. 2).
14. C. Baier and J.-P. Katoen, “Principles of Model Checking,” in MIT Press, 2008, ch. 6
(cit. on p. 35).
15. C. Baier and J.-P. Katoen, Principles of Model Checking. MIT Press, 2008 (cit. on p. 43).
16. G. Barthe, J.-P. Katoen, and A. Silva, Foundations of Probabilistic Programming. Cam-
bridge University Press, 2020 (cit. on p. 102).
17. A. F. Bielajew, “History of Monte Carlo,” in Monte Carlo Techniques in Radiation
Therapy, CRC Press, 2021, pp. 3–15 (cit. on p. 4).
18. A. Biere, Handbook of Satisfiability. IOS Press, 2009, vol. 185 (cit. on pp. 206, 208).
19. A. T. Borchers, F. Hagie, C. L. Keen, and M. E. Gershwin, “The History and Contem-
porary Challenges of the US Food and Drug Administration,” Clinical Therapeutics,
vol. 29, no. 1, pp. 1–16, 2007 (cit. on p. 5).
20. S. P. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press,
2004 (cit. on p. 166).
21. C. B. Browne, E. Powley, D. Whitehouse, et al., “A Survey of Monte Carlo Tree
Search Methods,” IEEE Transactions on Computational Intelligence and AI in Games,
vol. 4, no. 1, pp. 1–43, 2012 (cit. on p. 80).
22. O. Cappé, A. Guillin, J.-M. Marin, and C. P. Robert, “Population Monte Carlo,”
Journal of Computational and Graphical Statistics, vol. 13, no. 4, pp. 907–929, 2004 (cit.
on p. 123).
23. G. Casella and E. I. George, “Explaining the Gibbs Sampler,” The American Statisti-
cian, vol. 46, no. 3, pp. 167–174, 1992 (cit. on p. 102).
24. F. Cérou and A. Guyader, “Adaptive Multilevel Splitting for Rare Event Analysis,”
Stochastic Analysis and Applications, vol. 25, no. 2, pp. 417–443, 2007 (cit. on p. 139).
25. J. K. Choi and Y. G. Ji, “Investigating the Importance of Trust on Adopting an
Autonomous Vehicle,” International Journal of Human-Computer Interaction, vol. 31,
no. 10, pp. 692–702, 2015 (cit. on p. 7).
26. B. Christian, The Alignment Problem: Machine Learning and Human Values. W. W.
Norton & Company, 2020 (cit. on p. 2).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
references 267

27. J. Colin, T. Fel, R. Cadène, and T. Serre, “What I Cannot Predict, I Do Not Understand:
A Human-Centered Evaluation Framework for Explainability Methods,” Advances
in Neural Information Processing Systems (NeurIPS), pp. 2832–2845, 2022 (cit. on
p. 226).
28. V. Conitzer, “Eliciting Single-Peaked Preferences Using Comparison Queries,”
Journal of Artificial Intelligence Research, vol. 35, pp. 161–191, 2009 (cit. on p. 28).
29. A. Couëtoux, J.-B. Hoock, N. Sokolovska, O. Teytaud, and N. Bonnard, “Continuous
Upper Confidence Trees,” in Learning and Intelligent Optimization (LION), 2011 (cit.
on p. 84).
30. S. Dandl, C. Molnar, M. Binder, and B. Bischl, “Multi-Objective Counterfactual
Explanations,” in International Conference on Parallel Problem Solving from Nature,
2020 (cit. on pp. 243, 246, 248).
31. T. Dang and T. Nahhal, “Coverage-Guided Test Generation for Continuous and
Hybrid Systems,” Formal Methods in System Design, vol. 34, pp. 183–213, 2009 (cit.
on p. 76).
32. P.-T. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein, “A Tutorial on the
Cross-Entropy Method,” Annals of Operations Research, vol. 134, pp. 19–67, 2005 (cit.
on p. 119).
33. P. Del Moral, A. Doucet, and A. Jasra, “Sequential Monte Carlo Samplers,” Journal of
the Royal Statistical Society Series B: Statistical Methodology, vol. 68, no. 3, pp. 411–436,
2006 (cit. on p. 127).
34. H. Delecki, A. Corso, and M. J. Kochenderfer, “Model-Based Validation as Proba-
bilistic Inference,” in Conference on Learning for Dynamics and Control (L4DC), 2023
(cit. on p. 97).
35. W. M. Dickie, “A Comparison of the Scientific Method and Achievement of Aristotle
and Bacon,” The Philosophical Review, vol. 31, no. 5, pp. 471–494, 1922 (cit. on p. 3).
36. M. Dowson, “The Ariane 5 Software Failure,” Software Engineering Notes, vol. 22,
no. 2, p. 84, 1997 (cit. on p. 7).
37. S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth, “Hybrid Monte Carlo,”
Physics Letters B, vol. 195, no. 2, pp. 216–222, 1987 (cit. on p. 102).
38. A. Duret-Lutz, “Manipulating LTL Formulas Using Spot 1.0,” in Automated Technol-
ogy for Verification and Analysis, 2013 (cit. on p. 43).
39. EASA AI Task Force, “Concepts of Design Assurance for Neural Networks,” EASA,
2020 (cit. on p. 5).
40. V. Elvira, L. Martino, D. Luengo, and M. F. Bugallo, “Generalized Multiple Im-
portance Sampling,” Statistical Science, vol. 34, no. 1, pp. 129–155, 2019 (cit. on
p. 116).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
268 references

41. A. Engel, Verification, Validation, and Testing of Engineered Systems. John Wiley & Sons,
2010, vol. 73 (cit. on p. 1).
42. J. M. Esposito, J. Kim, and V. Kumar, “Adaptive RRTs for Validating Hybrid Robotic
Control Systems,” in Algorithmic Foundations of Robotics, Springer, 2005, pp. 107–121
(cit. on p. 74).
43. M. Everett, G. Habibi, C. Sun, and J. P. How, “Reachability Analysis of Neural
Feedback Loops,” IEEE Access, vol. 9, pp. 163 938–163 953, 2021 (cit. on p. 194).
44. M. Everett, G. Habibi, C. Sun, and J. P. How, “Reachability Analysis of Neural
Feedback Loops,” IEEE Access, vol. 9, pp. 163 938–163 953, 2021 (cit. on p. 197).
45. M. Forets and C. Schilling, “LazySets.jl: Scalable Symbolic-Numeric Set Compu-
tations,” Proceedings of the JuliaCon Conferences, vol. 1, no. 1, pp. 1–11, 2021 (cit. on
p. 147).
46. K. Forsberg and H. Mooz, “The Relationship of System Engineering to the Project
Cycle,” Center for Systems Management, vol. 5333, 1991 (cit. on p. 5).
47. R. Geirhos, J.-H. Jacobsen, C. Michaelis, et al., “Shortcut Learning in Deep Neural
Networks,” Nature Machine Intelligence, vol. 2, no. 11, pp. 665–673, 2020 (cit. on
p. 229).
48. J. W. Gelder, “Air Law: The Federal Aviation Act of 1958,” Michigan Law Review,
vol. 57, no. 8, pp. 1214–1227, 1959 (cit. on p. 5).
49. B. Goodman and S. Flaxman, “European Union Regulations on Algorithmic Decision-
Making and a ‘Right to Explanation’,” AI Magazine, vol. 38, no. 3, pp. 50–57, 2017
(cit. on p. 226).
50. M. A. Gosavi, B. B. Rhoades, and J. M. Conrad, “Application of Functional Safety in
Autonomous Vehicles Using ISO 26262 Standard: A Survey,” in SoutheastCon, 2018
(cit. on p. 5).
51. U. Grenander and M. I. Miller, “Representations of Knowledge in Complex Sys-
tems,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 56, no. 4,
pp. 549–581, 1994 (cit. on p. 99).
52. A. Griewank and A. Walther, Evaluating Derivatives: Principles and Techniques of
Algorithmic Differentiation, 2nd ed. SIAM, 2008 (cit. on p. 103).
53. H. Hansson and B. Jonsson, “A Logic for Reasoning about Time and Reliability,”
Formal Aspects of Computing, vol. 6, pp. 512–535, 1994 (cit. on p. 23).
54. P. E. Hart, N. J. Nilsson, and B. Raphael, “A Formal Basis for the Heuristic Determi-
nation of Minimum Cost Paths,” IEEE Transactions on Systems Science and Cybernetics,
vol. 4, no. 2, pp. 100–107, 1968 (cit. on p. 80).
55. W. K. Hastings, “Monte Carlo Sampling Methods Using Markov Chains and Their
Applications,” Biometrika, vol. 57, no. 1, pp. 97–97, 1970 (cit. on p. 94).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
references 269

56. C. Hensel, S. Junges, J.-P. Katoen, T. Quatmann, and M. Volk, “The Probabilistic
Model Checker Storm,” International Journal on Software Tools for Technology Transfer,
pp. 1–22, 2022 (cit. on p. 210).
57. J. Herkert, J. Borenstein, and K. Miller, “The Boeing 737 MAX: Lessons for Engi-
neering Ethics,” Science and Engineering Ethics, vol. 26, pp. 2957–2974, 2020 (cit. on
p. 7).
58. M. D. Hoffman, A. Gelman, et al., “The No-U-Turn Sampler: Adaptively Setting
Path Lengths in Hamiltonian Monte Carlo.,” Journal of Machine Learning Research
(JMLR), vol. 15, no. 1, pp. 1593–1623, 2014 (cit. on p. 102).
59. A. L. Hunkenschroer and A. Kriebitz, “Is AI Recruiting (Un)ethical? A Human
Rights Perspective on the Use of AI for Hiring,” AI and Ethics, vol. 3, no. 1, pp. 199–
213, 2023 (cit. on p. 6).
60. M. Huth and M. Ryan, Logic in Computer Science: Modelling and Reasoning about
Systems. Cambridge University Press, 2004 (cit. on pp. 30, 31).
61. K. Ishikawa and J. H. Loftus, “Introduction to Quality Control,” in Springer, 1990,
vol. 98, ch. 1 (cit. on p. 3).
62. V. S. Iyengar, J. Lee, and M. Campbell, “Q-EVAL: Evaluating Multiple Attribute
Items Using Queries,” in ACM Conference on Electronic Commerce, 2001 (cit. on p. 29).
63. L. Jaulin, M. Kieffer, O. Didrit, and É. Walter, Interval Analysis. Springer, 2001 (cit.
on p. 171).
64. P. Jorion, “Risk Management Lessons from Long-Term Capital Management,” Euro-
pean Financial Management, vol. 6, no. 3, pp. 277–300, 2000 (cit. on p. 7).
65. H. Kahn and T. E. Harris, “Estimation of Particle Transmission by Random Sam-
pling,” National Bureau of Standards Applied Mathematics Series, vol. 12, pp. 27–30,
1951 (cit. on p. 138).
66. G. K. Kamenev, “An Algorithm for Approximating Polyhedra,” Computational Math-
ematics and Mathematical Physics, vol. 4, no. 36, pp. 533–544, 1996 (cit. on p. 162).
67. S. Karaman and E. Frazzoli, “Incremental Sampling-Based Algorithms for Optimal
Motion Planning,” Robotics Science and Systems VI, vol. 104, no. 2, pp. 267–274, 2010
(cit. on p. 79).
68. H. Karloff, Linear Programming. Springer, 2008 (cit. on p. 164).
69. S. M. Katz, K. D. Julian, C. A. Strong, and M. J. Kochenderfer, “Generating Probabilis-
tic Safety Guarantees for Neural Network Controllers,” Machine Learning, vol. 112,
pp. 2903–2931, 2023 (cit. on p. 218).
70. J. Kleinberg, S. Mullainathan, and M. Raghavan, “Inherent Trade-Offs in the Fair
Determination of Risk Scores,” in Innovations in Theoretical Computer Science (ITCS)
Conference, 2017 (cit. on p. 7).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
270 references

71. M. J. Kochenderfer and T. A. Wheeler, Algorithms for Optimization. MIT Press, 2019
(cit. on pp. 60, 246).
72. M. J. Kochenderfer, T. A. Wheeler, and K. H. Wray, Algorithms for Decision Making.
MIT Press, 2022 (cit. on pp. 2, 10).
73. A. Kossiakoff, S. M. Biemer, S. J. Seymour, and D. A. Flanigan, Systems Engineering
Principles and Practice. John Wiley & Sons, 2020 (cit. on p. 2).
74. J. Kuchar and A. C. Drumm, “The Traffic Alert and Collision Avoidance System,”
Lincoln Laboratory Journal, vol. 16, no. 2, p. 277, 2007 (cit. on p. 6).
75. M. Kwiatkowska, G. Norman, and D. Parker, “PRISM 4.0: Verification of Proba-
bilistic Real-Time Systems,” in International Conference on Computer Aided Verification,
2011 (cit. on p. 210).
76. S. LaValle, “Planning Algorithms,” Cambridge University Press, vol. 2, pp. 3671–3678,
2006 (cit. on p. 69).
77. R. Lee, M. J. Kochenderfer, O. J. Mengshoel, and J. Silbermann, “Interpretable Cate-
gorization of Heterogeneous Time Series Data,” in SIAM International Conference on
Data Mining, 2018 (cit. on p. 253).
78. R. Lee, O. J. Mengshoel, A. Saksena, et al., “Adaptive Stress Testing: Finding Likely
Failure Events with Reinforcement Learning,” Journal of Artificial Intelligence Research,
vol. 69, pp. 1165–1201, 2020 (cit. on p. 82).
79. K. Leung, N. Aréchiga, and M. Pavone, “Backpropagation Through Signal Temporal
Logic Specifications: Infusing Logical Structure into Gradient-Based Methods,” The
International Journal of Robotics Research, vol. 42, no. 6, pp. 356–370, 2023 (cit. on
p. 41).
80. N. G. Leveson and C. S. Turner, “An Investigation of the Therac-25 Accidents,”
Computer, vol. 26, no. 7, pp. 18–41, 1993 (cit. on p. 6).
81. C. Liu, T. Arnon, C. Lazarus, C. Strong, C. Barrett, and M. J. Kochenderfer, “Algo-
rithms for Verifying Deep Neural Networks,” Foundations and Trends in Optimization,
vol. 4, no. 3–4, pp. 244–404, 2021 (cit. on p. 194).
82. F. Llorente, L. Martino, D. Delgado, and J. Lopez-Santiago, “Marginal Likelihood
Computation for Model Selection and Hypothesis Testing: an Extensive Review,”
SIAM Review, vol. 65, no. 1, pp. 3–58, 2023 (cit. on pp. 126, 129).
83. S. Lloyd, “Least Squares Quantization in PCM,” IEEE Transactions on Information
Theory, vol. 28, no. 2, pp. 129–137, 1982 (cit. on p. 249).
84. K. Makino and M. Berz, “Taylor Models and Other Validated Functional Inclusion
Methods,” International Journal of Pure and Applied Mathematics, vol. 4, no. 4, pp. 379–
456, 2003 (cit. on p. 180).
85. O. Maler, “Computing Reachable Sets: An Introduction,” French National Center of
Scientific Research, pp. 1–8, 2008 (cit. on p. 154).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
references 271

86. O. Maler and D. Nickovic, “Monitoring Temporal Properties of Continuous Signals,”


in International Symposium on Formal Techniques in Real-Time and Fault-Tolerant Systems,
2004 (cit. on p. 37).
87. R. L. McCarthy, “Autonomous Vehicle Accident Data Analysis: California OL 316
Reports: 2015–2020,” ASCE-ASME Journal of Risk and Uncertainty in Engineering
Systems, Part B: Mechanical Engineering, vol. 8, no. 3, p. 034 502, 2022 (cit. on p. 6).
88. N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller, “Equa-
tion of State Calculations by Fast Computing Machines,” Journal of Chemical Physics,
vol. 21, no. 6, pp. 1087–1092, 1953 (cit. on p. 94).
89. B. P. Miller, L. Fredriksen, and B. So, “An Empirical Study of the Reliability of UNIX
Utilities,” Communications of the ACM, vol. 33, no. 12, pp. 32–44, 1990 (cit. on p. 53).
90. C. N. Murphy and J. Yates, The International Organization for Standardization (ISO):
Global Governance Through Voluntary Consensus. Routledge, 2009 (cit. on p. 5).
91. R. Neidinger, “Directions for Computing Truncated Multivariate Taylor Series,”
Mathematics of Computation, vol. 74, no. 249, pp. 321–340, 2005 (cit. on p. 177).
92. J. Nocedal, “Updating Quasi-Newton Matrices with Limited Storage,” Mathematics
of Computation, vol. 35, no. 151, pp. 773–782, 1980 (cit. on pp. 62, 67).
93. J. R. Norris, Markov Chains. Cambridge University Press, 1998 (cit. on p. 210).
94. R. Page and R. Gamboa, Essential Logic for Computer Science. MIT Press, 2019 (cit.
on p. 31).
95. J. Pearl, “Direct and Indirect Effects,” in Conference on Uncertainty in Artificial Intelli-
gence (UAI), 2001 (cit. on p. 229).
96. D. Phillips-Donaldson, “100 Years of Juran,” Quality Progress, vol. 37, no. 5, pp. 25–
31, 2004 (cit. on p. 4).
97. A. Pnueli, “The Temporal Logic of Programs,” in Symposium on Foundations of
Computer Science (SFCS), 1977 (cit. on p. 35).
98. W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical Recipes
3rd Edition: The Art of Scientific Computing. Cambridge University Press, 2007 (cit.
on p. 65).
99. J. Reason, “Human Error: Models and Management,” British Medical Journal, vol. 320,
no. 7237, pp. 768–770, 2000 (cit. on p. 14).
100. R. G. Regis, “On the Properties of Positive Spanning Sets and Positive Bases,”
Optimization and Engineering, vol. 17, no. 1, pp. 229–262, 2016 (cit. on p. 161).
101. M. T. Ribeiro, S. Singh, and C. Guestrin, “‘Why Should I Trust You?’ Explaining the
Predictions of Any Classifier,” in ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, 2016 (cit. on p. 240).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
272 references

102. L. Rierson, Developing Safety-Critical Software: a Practical Guide for Aviation Software
and DO-178C Compliance. CRC Press, 2017 (cit. on p. 5).
103. C. P. Robert and G. Casella, Monte Carlo Statistical Methods. Springer, 1999, vol. 2 (cit.
on pp. 90, 94).
104. R. T. Rockafellar and S. Uryasev, “Optimization of Conditional Value-at-Risk,”
Journal of Risk, vol. 2, pp. 21–42, 2000 (cit. on p. 24).
105. W. W. Royce, “Managing the Development of Large Software Systems: Concepts
and Techniques,” IEEE WESCON, 1970 (cit. on p. 5).
106. S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 4th ed. Pearson,
2021 (cit. on p. 205).
107. R. Seidel, “Convex Hull Computations,” in Handbook of Discrete and Computational
Geometry, Chapman and Hall, 2017, pp. 687–703 (cit. on p. 156).
108. L. S. Shapley, “Notes on the N-Person Game—II: The Value of an N-Person Game,”
1951 (cit. on p. 236).
109. C. Sidrane, A. Maleki, A. Irfan, and M. J. Kochenderfer, “OVERT: An Algorithm
for Safety Verification of Neural Network Control Policies for Nonlinear Systems,”
Journal of Machine Learning Research, vol. 23, no. 117, pp. 1–45, 2022 (cit. on pp. 189,
190).
110. J. Siegel and G. Pappas, “Morals, Ethics, and the Technology Capabilities and
Limitations of Automated and Self-Driving Vehicles,” AI & Society, vol. 38, no. 1,
pp. 213–226, 2023 (cit. on p. 7).
111. S. Sigl and M. Althoff, “M-Representation of Polytopes,” ArXiv:2303.05173, 2023
(cit. on p. 156).
112. K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep Inside Convolutional Networks:
Visualising Image Classification Models and Saliency Maps,” in International Con-
ference on Learning Representations (ICLR), 2014 (cit. on p. 234).
113. A. Sinha, M. O’Kelly, R. Tedrake, and J. C. Duchi, “Neural Bridge Sampling for
Evaluating Safety-Critical Autonomous Systems,” Advances in Neural Information
Processing Systems (NeurIPS), vol. 33, pp. 6402–6416, 2020 (cit. on pp. 126, 136).
114. D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg, “Smoothgrad: Remov-
ing Noise by Adding Noise,” in International Conference on Machine Learning (ICML),
2017 (cit. on p. 234).
115. E. Soroka, M. J. Kochenderfer, and S. Lall, “Satisfiability.jl: Satisfiability Modulo
Theories in Julia,” Journal of Open Source Software, vol. 9, no. 100, p. 6757, 2024 (cit.
on p. 206).
116. C. A. Strong, H. Wu, A. Zeljic, et al., “Global Optimization of Objective Functions
Represented by ReLU Networks,” Machine Learning, vol. 112, pp. 3685–3712, 2023
(cit. on p. 197).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
references 273

117. E. Štrumbelj and I. Kononenko, “Explaining Prediction Models and Individual


Predictions with Feature Contributions,” Knowledge and Information Systems, vol. 41,
pp. 647–665, 2014 (cit. on p. 237).
118. O. Stursberg and B. H. Krogh, “Efficient Representation and Computation of Reach-
able Sets for Hybrid Systems,” in Hybrid Systems: Computation and Control, 2003 (cit.
on p. 164).
119. M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic Attribution for Deep Networks,”
in International Conference on Machine Learning (ICML), 2017 (cit. on p. 234).
120. R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, Second Edition.
MIT Press, 2018 (cit. on p. 84).
121. E. Thiémard, “An Algorithm to Compute Bounds for the Star Discrepancy,” Journal
of Complexity, vol. 17, no. 4, pp. 850–880, 2001 (cit. on p. 75).
122. S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics. MIT Press, 2006 (cit. on
p. 129).
123. R. Tibshirani, “Regression Shrinkage and Selection via the Lasso,” Journal of the
Royal Statistical Society Series B: Statistical Methodology, vol. 58, no. 1, pp. 267–288,
1996 (cit. on p. 242).
124. V. Tjeng, K. Y. Xiao, and R. Tedrake, “Evaluating Robustness of Neural Networks
with Mixed Integer Programming,” in International Conference on Learning Represen-
tations (ICLR), 2018 (cit. on pp. 189, 197).
125. H.-D. Tran, D. Manzanas Lopez, P. Musau, et al., “Star-Based Reachability Analysis
of Deep Neural Networks,” in International Symposium on Formal Methods, 2019 (cit.
on p. 180).
126. W. M. Tsutsui, “W. Edwards Deming and the Origins of Quality Control in Japan,”
Journal of Japanese Studies, vol. 22, no. 2, pp. 295–325, 1996 (cit. on p. 4).
127. M. Vazquez-Chanlatte, J. V. Deshmukh, X. Jin, and S. A. Seshia, “Logical Clustering
and Learning for Time-Series Data,” in International Conference on Computer Aided
Verification, 2017 (cit. on p. 252).
128. L. A. Wolsey, Integer Programming. Wiley, 2020 (cit. on p. 189).
129. W. Xiang, H.-D. Tran, and T. T. Johnson, “Output Reachable Set Estimation and
Verification for Multilayer Neural Networks,” IEEE Transactions on Neural Networks
and Learning Systems, vol. 29, no. 11, pp. 5777–5783, 2018 (cit. on p. 197).
130. W. Xiang, H.-D. Tran, J. A. Rosenfeld, and T. T. Johnson, “Reachable Set Estima-
tion and Safety Verification for Piecewise Linear Systems with Neural Network
Controllers,” in American Control Conference (ACC), 2018 (cit. on p. 195).
131. D. Xu and Y. Tian, “A Comprehensive Survey of Clustering Algorithms,” Annals of
Data Science, vol. 2, pp. 165–193, 2015 (cit. on p. 249).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
274 index

132. H. Zhang, T.-W. Weng, P.-Y. Chen, C.-J. Hsieh, and L. Daniel, “Efficient Neural
Network Robustness Certification with General Activation Functions,” Advances in
Neural Information Processing Systems (NeurIPS), vol. 31, 2018 (cit. on p. 197).
133. Y.-D. Zhou, K.-T. Fang, and J.-H. Ning, “Mixture Discrepancy for Quasi-Random
Point Sets,” Journal of Complexity, vol. 29, no. 3-4, pp. 283–301, 2013 (cit. on p. 75).
134. G. M. Ziegler, Lectures on Polytopes. Springer Science & Business Media, 2012, vol. 152
(cit. on p. 155).
135. A. Zutshi, J. V. Deshmukh, S. Sankaranarayanan, and J. Kapinski, “Multiple Shoot-
ing, CEGAR-Based Falsification for Hybrid Systems,” in International Conference on
Embedded Software, 2014 (cit. on p. 67).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
Index

H-polytope, 155 Computation tree logic, 35 disjunction, 31


V -polytope, 155 Concrete reachability, 184 dispersion, 73
k-means, 249 concretize, 184 disturbance distribution, 52
conditional value at risk, 24 disturbances, 49, 50
activation pattern, 196 conjunction, 31 double progressive widening, 84
Adaptive importance sampling, 119 conservative linearization, 180
adaptive stress testing, 82 consistent, 109
elite samples, 121
admissible, 80 convex, 154
entropic value at risk, 24
adversary, 84 convex combinations, 155
environment, 8
alignment problem, 2, 225 convex hull, 155
episode, 84
atomic proposition, 30 convex set, 154
existential quantifier, 33
average dispersion, 74 counterexample, 205
expected value, 22
avoid set, 146 counterexample search, 205
explainability, 226
counterexamples, 49
explanation, 225
backward reachability, 201 counterfactual, 243
exploitation, 80
biconditional, 31 counterfactual explanation, 243
exploration, 80
black-box simulators, 61 coverage, 73
Boolean satisfiability, 206 coverage metrics, 73
bounded model checking, 201 cross entropy, 119 failure region, 72
breadth-first search, 201 cross entropy method, 119 failure trajectories, 49
bridge density, 132 CTL, see Computation tree logic failures, 49
bridge sampling, 132 CVaR, see conditional value at risk Falsification, 13
burn-in, 94 falsification, 49
Büchi automaton, 42 falsifying trajectories, 49
decision tree, 243 feature, 227
clustering, 249 defect, 65 feature importance, 227
coefficient of variation, 142 dependency effect, 150 first-order, 61
coherent risk measure, 24 direct methods, 61 first-order logic, 31
composite metric, 26 discrepancy, 74 forward reachability, 145, 201
composite metrics, 26 discrete state abstraction, 216 fuzzing, 53
276 index

generators, 156 mean excess loss, 24 proposition, 30


global explanation, 240 mean shortfall, 24 propositional logic, 30
mean value theorem, 174 prototypical example, 251
half space, 154 metrics, 21 PSTL, see parametric signal temporal
Hausdorff distance, 158 Metropolis-Adjusted Langevin Algo- logic
hyperrectangle, 157 rithm, 99
Metropolis-Hastings, 94 quantifiers, 31
implication, 31 Minkowski sum, 150
Importance sampling, 113 mixed-integer linear program, 189 rapidly exploring random trees (RRT),
inclusion function, 173 Monte Carlo tree search (MCTS), 80 68
individuals, 63 multiple importance sampling, 116 ratio importance sampling, 131
integrated gradients, 234 multiple shooting, 66 reachability specification, 41
interpretability, 226 reachable set, 146
interval arithmetic, 171 natural inclusion function, 174 rectified linear unit, 190
interval box, 171, see hyperrectangle negation, 31 Reinforcement learning, 84
interval counterpart, 172 neural network verification, 194 Rejection sampling, 90
interval hull, 172 nonparametric, 129 risk metric, 23
intervals, 157 robustness, 39
invariant set, 153 overapproximation, 158 rollout, 10
iterative deepening, 205 overapproximation error, 158
iterative refinement, 162 safety case, 14
parametric signal temporal logic, 252 saliency maps, 231
kernel, 94 Pareto frontier, 26 sample efficiency, 85
Pareto optimal, 26 SAT, see Boolean satisfiability
Lagrange remainder, 180 partially observable Markov decision Satisfiability modulo theories, 208
linear inequalities, 154 process, 10 second-order, 61
linear model, 240 partitioning, 190 Self-normalized importance sampling,
linear program, 163 polyhedron, 154 133
linear systems, 146 polynomial zonotopes, 180 sensitivity analysis, 229
Linear temporal logic, 35 polytope, 154 sensor, 9
linearization, 176 polytopes, 154 sequential Monte Carlo, 125
local descent methods, 61 population, 123 Set propagation, 147
local explanation, 240 Population methods, 63 Shapley value, 236
logic gates, 31 Population Monte Carlo, 123 Shooting methods, 65
logical formula, 30 positive spanning set, 161 signal, 37
logical specification, 30 predicate function, 31 Signal temporal logic, 37
lower confidence bound (LCB), 82 predicates, 31 single shooting, 66
LTL, see Linear temporal logic preference elicitation, 28 smooth robustness, 41
Probabilistic programming, 102 Smoothing, 97
marginal satisfaction, 252 product system, 45 SMT, see Satisfiability modulo theories
Markov chain, 94 progressive widening, 82 softmax, 41
Markov chain Monte Carlo, 94 proposal distribution, 90 softmin, 41

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
index 277

specification, 8, 11 tail value at risk, 24 validation, 1


specifications, 21 Taylor inclusion function, 176 value at risk, 23
spurious correlations, 229 Taylor models, 180 VaR, see value at risk
standard error, 108 temporal logic, 33 variable, 31
Star discrepancy, 75 testing, 1 variance, 23
star sets, 180 trajectory, 10 verification, 1
stationary, 54 truthtable, 31
STL, see Signal temporal logic
waterfall model, 4
support function, 159 umbrella sampling, 131
weighted exponential sum, 28
support vector, 161 unbiased, 109
weighted metrics, 26
supremum, 73 unbounded model checking, 203
weighted sum, 26
surrogate model, 237 universal quantifier, 33
wrapping effect, 186
Swiss cheese model, 14 utopia point, 26
symbolic reachability, 184
system, 8 V model, 5 zonotope, 156

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com

You might also like