Algorithms For Validation
Algorithms For Validation
Mykel J. Kochenderfer
Sydney M. Katz
Anthony L. Corso
Robert J. Moss
Stanford, California
© 2024 Kochenderfer, Katz, Corso, and Moss
All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means
(including photocopying, recording or information storage and retrieval) without permission in writing from
the publisher.
This book was set in TEX Gyre Pagella by the authors in LATEX.
Printed and bound in the United States of America.
ISBN:
10 9 8 7 6 5 4 3 2 1
To our families.
Contents
Preface xi
1 Introduction 1
1.1 Validation 1
1.2 History 3
1.3 Societal Consequences 6
1.4 Validation Algorithms 8
1.5 Challenges 14
1.6 Overview 16
2 System Modeling 19
2.1 Coming Soon 19
3 Property Specification 21
3.1 Properties of Systems 21
3.2 Metrics for Stochastic Systems 22
3.3 Composite Metrics 24
3.4 Logical Specifications 30
3.5 Temporal Logic 33
3.6 Reachability Specifications 41
3.7 Summary 45
4 Falsification through Optimization 49
4.1 Direct Sampling 49
4.2 Disturbances 50
4.3 Fuzzing 53
viii co ntents
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
x c ontents
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
Preface
M ykel J. Ko chenderfer
Sydney M. Katz
Antho ny L . Corso
Ro bert J. Mo ss
Stanford, California
August 28, 2024
1 Introduction
1.1 Validation
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
1.2. history 3
example, aircraft designers validate the structural integrity of the wings through
extensive stress testing, and medical device manufacturers validate the safety
of their devices through clinical trials. In this book, we present an algorithmic
perspective on validation and focus specifically on the validation of decision-
making agents.
Decision-making agents interact with the environment and make decisions
based on the information they receive. These agents range from fully automated
systems that operate independently within their environment to decision-support
systems that inform human decision-makers.5 Examples include aircraft collision 5
Autonomy and automation have
different definitions in different
avoidance systems, adaptive cruise control systems, hiring assistants, disaster
communities. Autonomy is often
response systems, and other cyberphysical systems.6 While the algorithms pre- defined as the automation of high-
sented in this book can be applied to many different types of decision-making level tasks such as driving. The al-
gorithms in this book can be ap-
agents, we place a particular emphasis on sequential decision-making agents, plied to decision-making systems
which make a series of decisions over time. For example, an autonomous vehicle with any level of automation or au-
must make a sequence of decisions to navigate from one location to another. tonomy.
6
Cyberphysical systems are com-
putational systems that interact
1.2 History with the physical world.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
4 c h ap ter 1 . i ntroduction
Design
System System
Requirements Validation
Development
Detailed Unit
Testing
Design Tests
Deployment
Implementation
Maintenance
During World War II, production volume increased to the point where it was no 9
W. M. Tsutsui, “W. Edwards Dem-
longer possible to inspect every product. This increase in production output led ing and the Origins of Quality Con-
trol in Japan,” Journal of Japanese
to the adoption of statistical quality control methods, which relied on sampling Studies, vol. 22, no. 2, pp. 295–325,
to speed up inspection. These ideas were developed by W. Edwards Deming9 1996.
(1900–1993) and Joseph M. Juran10 (1904–2008) and marked the beginning of 10
D. Phillips-Donaldson, “100
the field of statistical process control. Deming and Juran introduced these ideas Years of Juran,” Quality Progress,
vol. 37, no. 5, pp. 25–31, 2004.
to Japanese manufacturers after World War II, which played a key role in the
post-war economic recovery of Japan.
The advancements in computing technology in the latter half of the 20th cen-
tury increased our ability to use statistical methods to validate complex systems.
In the late 1940s, scientists at Los Alamos National Laboratory developed the
Monte Carlo method, which uses random sampling to solve complex mathemati-
cal problems.11 These methods were later used to validate complex systems in A. F. Bielajew, “History of Monte
11
a variety of domains such as aviation and finance. Progress in computing tech- Carlo,” in Monte Carlo Techniques in
Radiation Therapy, CRC Press, 2021,
nology also led to new challenges in validation. The development of software pp. 3–15.
systems required new validation techniques and best practices to ensure that the
software operated correctly.
In the 1970s, software engineers began formalizing the software development
life cycle into phases that supported rigorous testing and validation. The water-
fall model of software development, introduced in 1970, divided the software
development process into distinct phases including requirements, design, im-
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
1.2. history 5
plementation, testing, and maintenance.12 In the 1990s, the waterfall model was 12
W. W. Royce, “Managing the De-
refined into the V model, which emphasizes the importance of testing and valida- velopment of Large Software Sys-
tems: Concepts and Techniques,”
tion throughout the software development process.13 The V model aligns testing IEEE WESCON, 1970.
and validation activities with the corresponding development activities, ensuring 13
K. Forsberg and H. Mooz, “The
Relationship of System Engineer-
that the system is validated at each stage of development. Figure 1.2 compares
ing to the Project Cycle,” Center
the waterfall and V models of the software development life cycle. for Systems Management, vol. 5333,
The 20th century also saw the emergence of regulatory bodies to guide the safe 1991.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
6 c h ap ter 1. i ntroduction
model to account for validation of the learning process. In general, the validation
of AI systems is still an active area of research.
1.3.1 Safety
Validation is necessary for ensuring the safety of systems that interact with the
physical world. Failures of safety-critical systems can result in catastrophic acci- 20
N. G. Leveson and C. S. Turner,
dents that cause injury or loss of life. For example, unintended behavior of the “An Investigation of the Therac-25
Accidents,” Computer, vol. 26, no. 7,
safety-critical software used by the Therac-25 radiation therapy machine caused pp. 18–41, 1993.
radiation overdoses that resulted in death or serious injury to six patients.20 Safety
is also important for transportation systems such as aircraft and cars. In 2002, 21
J. Kuchar and A. C. Drumm, “The
a mid-air collision over Überlingen, Germany resulted in 71 fatalities when the Traffic Alert and Collision Avoid-
ance System,” Lincoln Laboratory
traffic alert and collision avoidance system (TCAS) and air traffic control (ATC) Journal, vol. 16, no. 2, p. 277, 2007.
systems issued conflicting instructions to the pilots.21 Furthermore, it is important 22
R. L. McCarthy, “Autonomous
Vehicle Accident Data Analy-
to ensure that autonomous vehicles make safe decisions in a wide range of scenar-
sis: California OL 316 Reports:
ios to prevent potential accidents. Since their introduction, autonomous vehicles 2015–2020,” ASCE-ASME Journal
have been involved in accidents that have resulted in injuries or fatalities.22 of Risk and Uncertainty in Engi-
neering Systems, Part B: Mechanical
Engineering, vol. 8, no. 3, p. 034 502,
1.3.2 Fairness 2022.
When agents make decisions that affect the lives of large groups of people, we must
ensure that their decisions are fair and unbiased. Validation helps researchers
and organizations identify and correct biases in decision-making systems before
deployment. If these biases are not addressed, they can have serious consequences
for individuals and society as a whole. For example, an automated hiring system 23
A. L. Hunkenschroer and
developed by Amazon was ultimately discontinued after it was found to be A. Kriebitz, “Is AI Recruiting
(Un)ethical? A Human Rights
biased against women due to biases in the historical data it was trained on.23 In Perspective on the Use of AI for
another case, a software system designed to predict recidivism rates in criminal Hiring,” AI and Ethics, vol. 3, no. 1,
pp. 199–213, 2023.
defendants called COMPAS was found to be biased toward certain demographics
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
1.3. societal consequences 7
based on empirical data.24 Using the outputs of these systems to make decisions 24
Other research has argued that
can result in the unfair treatment of individuals. Validating these systems before the system is fair under a differ-
ent definition of fairness. A de-
deployment can help prevent this type of failure. tailed discussion is provided in J.
Kleinberg, S. Mullainathan, and
M. Raghavan, “Inherent Trade-Offs
1.3.3 Public Trust in the Fair Determination of Risk
Scores,” in Innovations in Theoretical
Public trust in autonomous systems is critical for their widespread adoption, and Computer Science (ITCS) Conference,
validation plays a key role in developing this trust. For example, trust has been 2017.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
8 c hap ter 1 . i ntroduction
Validation
Algorithm
Specification
Validation algorithms require two inputs, as shown in figure 1.3. The first input is
the system under test, which we will refer to as the system. The system represents
a decision-making agent operating in an environment. The agent makes decisions
based on information from the environment that it receives from sensors.30 The 30
Up to this point, we have infor-
second input is a specification, which expresses an operating requirement for the mally used the term system to refer
to only the agent and its sensors.
system. Specifications often pertain to safety, but they may also address other key For the remainder of the book, we
design objectives. Given these inputs, validation algorithms output metrics to will also include the operating en-
vironment as part of the system.
help us understand the scenarios in which the system does or does not satisfy
the specification. The rest of this section provides a high-level overview of these
inputs and outputs.
1.4.1 System
A system (algorithm 1.1) consists of three main components: an environment,
an agent, and a sensor. The environment represents the world in which the agent
operates. We refer to an agent’s configuration within its environment as its state s.
The state space S represents the set of all possible states. An environment consists
of an initial state distribution and a transition model. When the agent takes an
action, the state evolves probabilistically according to the transition model. The
transition model T (s0 | s, a) denotes the probability of transitioning to state s0
from state s when the agent takes action a.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
1.4. validation algorith ms 9
For physical systems, the state often represents an agent’s position and velocity
in the environment, and the transition model is typically governed by the agent’s
equations of motion. Figure 1.4 shows an example of a state for an inverted
pendulum system. The state and transition model may also contain information
about other agents in the environment. For example, the environment for an
aircraft collision avoidance system contains the other aircraft in the airspace that w
the agent must avoid. The other agents may also be human agents such as other q
drivers or pedestrians in the environment of an autonomous vehicle. The presence
of other agents in the environment often increases our uncertainty in the outcome s = [q, w ]
of a particular action.
In many real-world systems, agents do not have access to their true state within Figure 1.4. The state s of an in-
the environment and instead rely on observations from sensors. We define the verted pendulum system can be
sensor component of a system as a mechanism for sensing information about the compactly represented as its cur-
rent angle from the vertical q and
environment. Many real-world systems rely on multiple sensors, so the sensor its angular velocity w.
component may contain multiple sensing modalities. For example, an autonomous
vehicle senses its position in the world using a combination of sensors such as
global positioning systems (GPS), cameras, and LiDAR. We model the sensor
component using an observation model O(o | s), which represents the probability
of producing observation o in state s. Observations come in multiple forms based
on the sensing modality. For example, GPS sensors output coordinates, while
camera sensors output image data. We call the set of all possible observations for
a system its observation space O .
An agent uses observations to select actions from a set of possible actions
known as the action space A. Agents may use a number of decision-making
algorithms or frameworks to select actions. While some agents select actions
based entirely on the observation, other agents use the observation to first estimate
the state and then select an action based on this estimate. Furthermore, some
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
10 c h ap ter 1 . i ntroduction
observation o state s
Sensor
Observation Model
O
agents may keep track of previous actions and observations internally to improve
their state estimate. For example, an aircraft that only observes its altitude may
keep track of previous altitude measurements to estimate its climb or descent
rate. We abstract these behaviors of the agent using the notion of a policy p,
which is responsible for selecting an action given the current observation and
information the agent has stored previously. An agent’s policy can be stochastic
or deterministic. A stochastic policy samples actions according to a probability
distribution, while a deterministic policy will always produce the same action
given the same information.
The transition model T (s0 | s, a) satisfies the Markov assumption, which requires
that the next state depend only on the current state and action. The state space,
action space, observation space, observation model, and transition model are all el-
ements of a sequential decision-making framework known as a partially observable
Markov decision process (POMDP).31 Figure 1.5 demonstrates how these elements 31
M. J. Kochenderfer, T. A. Wheeler,
fit into the components of a system. Appendix A provides implementations of and K. H. Wray, Algorithms for De-
cision Making. MIT Press, 2022.
these components for the example systems discussed in this book.
We analyze the behavior of a system over time by considering the sequence
of states, observations, and actions that the agent experiences. This sequence
is known as a trajectory. We generate trajectories by performing a rollout of the
system (algorithm 1.2). A rollout begins by sampling an initial state from the
initial state distribution associated with the environment. At each time step, the
sensor produces an observation based on the current state, the agent selects an
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
1.4. validation algorithms 11
action based on the observation, and the environment transitions to a new state
based on the action. We repeat this process to a desired depth d to generate a
trajectory t = (s1 , o1 , a1 , . . . , sd , od , ad ) where si+1 ⇠ T (· | si , ai ), oi ⇠ O(· | si ),
and ai ⇠ p (· | oi ). Figure 1.6 shows an example trajectory for the inverted
pendulum system.
1.4.2 Specification
A specification y is a formal expression of a requirement that the system must
satisfy when deployed in the real world. These requirements may be derived from
domain knowledge or other systems engineering principles. Some industries
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12 c h ap ter 1 . i ntroduction
have regulatory agencies that govern requirements. These agencies are especially
common in safety-critical industries. For example, the FAA and the FDA in the
United States provide regulations and requirements for aircraft and healthcare
systems, respectively.
We express specifications by translating operating requirements to logical
formulas that can be evaluated on trajectories.32 For example, the specification 32
Chapter 3 discusses this process
for an aircraft collision avoidance system is that the agent should not collide with in detail.
other aircraft in the airspace. Given a trajectory, we want to check whether any of
the states in the trajectory represent a collision.
Algorithm 1.3 defines a general framework for specifications that we will use
throughout this book. Evaluating a specification on a trajectory results in a Boolean
value that indicates whether the specification is satisfied. We consider a trajectory
to be a failure if the specification is not satisfied. Example 1.1 demonstrates this
idea on a simple grid world system. We can also derive higher-level metrics from
specifications such as the probability of failure or the expected cost of failure.
In the grid world example shown on the right, the agent’s goal is to navigate Example 1.1. Example trajectories
evaluated against a specification
to the green goal state while avoiding the red obstacle state. Therefore, given for the grid world system.
a trajectory, the specification y will be satisfied if the trajectory contains the
goal state and does not contain the obstacle state. The green trajectory in the
figure satisfies the specification, while the red trajectory represents a failure.
Chapter 3 will discuss how to express this specification as a logical formula.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
1.4. validation algorithms 13
Falsification Failure Distribution Failure Probability Figure 1.7. Failure analysis outputs
for a simple system where failures
occur to the left of the dashed red
line with likelihood represented by
the height of the black curve. The
plot on the left shows a set of fail-
ure samples that could be identi-
fied through falsification. The plot
in the middle highlights the shape
of the failure distribution, and the
shaded region in the plot on the
right corresponds to the probabil-
ity of failure.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
14 c hap ter 1. i ntroduction
1.5 Challenges
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
1.5. challenges 15
• Cost and safety: Testing systems in the real world is expensive and can lead
to safety issues. For example, testing an aircraft collision avoidance system
involves operating aircraft in close proximity with one another for long periods
of time. For this reason, we often rely on simulation to test systems before
deploying them in the real world. We must be careful to ensure that the simu-
lated system accurately models the real-world system. However, capturing the
full complexity of the real world in simulation can result in simulators that are
computationally expensive to run.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
16 c h ap ter 1. i ntroduction
catastrophic failures. Because these edge cases occur infrequently, they are
often difficult to identify.
1.6 Overview
This section outlines the remaining chapters of the book, which can be organized
into several categories:
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
1.6. overview 17
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
2 System Modeling
Using the grid world specification, we could define a metric that measures the
distance between the agent and the goal or obstacle.
We use metrics or specifications to evaluate individual trajectories, sets of
trajectories, or probability distributions over trajectories. The miss distance be- 400
tween two aircraft can be used to measure the performance of an aircraft collision 200
miss
avoidance system in a single encounter scenario (figure 3.1), and the net return
h (m)
0 distance
can be used to measure the performance of a one outcome of a financial trading 200
strategy over time. We can also create metrics or specifications that operate over
400
a set of trajectories. For example, we can compute the average miss distance or
net gain over a set of possible trajectories or specify a threshold on the number
400
of trajectories that result in a collision. The remainder of this chapter discusses
techniques to formally express metrics and specifications. 200
average
h (m)
0 miss distance
400
For stochastic systems, we often compute metrics over the full distribution of 40 30 20 10 0
trajectories. Given a function f (t ) that maps an individual trajectory t to a real- tcol (s)
valued metric, we are interested in summarizing the distribution over the output
Figure 3.1. Example of a metric for
of f (t ) (figure 3.2). The remainder of this section outlines several metrics used an aircraft collision avoidance sys-
to summarize distributions. tem over an individual trajectory
(top) and over a set of trajectories
(bottom).
3.2.1 Expected Value
A common metric used to summarize a distribution is its expected value. The
expected value represents the average output of a function given a distribution expected
miss distance
over its inputs. It is defined as
Z
E t ⇠ p(·) [ f (t )] = f (t ) p(t ) dt (3.1)
where p(t ) is the probability distribution over trajectories. While it is not always 0 1,000 2,000
possible to evaluate the expected value analytically, we can estimate it using a Miss Distance (m)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
3.2. metrics for stochastic systems 23
specification is the probability that a randomly sampled trajectory will satisfy the
specification. We could also derive a high-level specification from this probability
by requiring that the probability of satisfying the specification is greater than a
certain threshold.1 1
H. Hansson and B. Jonsson, “A
Logic for Reasoning about Time
and Reliability,” Formal Aspects of
3.2.2 Variance Computing, vol. 6, pp. 512–535,
1994.
Another common summary metric is the variance, which measures the spread of
the distribution. The variance of a metric f (t ) is defined as
Intuitively, the variance measures how much the metric f (t ) deviates from its
expected value. A low variance indicates that the metric tends to be consistent
across different trajectories, while a high variance indicates that the metric varies
significantly. It is important to consider both the expected value and variance of a
metric when evaluating system performance (figure 3.3).
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
24 c h ap ter 3 . p roperty specification
a = 0.9 a = 0.7 a = 0.5 a = 0.3 a = 0.1 Figure 3.4. Effect of a on VaR and
CVaR. Higher values for a corre-
expected value spond to more conservative risk es-
VaR timates.
CVaR
In many real-world settings, we must select one of several system designs or 0.8
Collision Rate
strategies for final deployment, and metrics allow us to make an informed deci- 0.6
sion. For example, we might compare the performance of two aircraft collision
0.4
avoidance systems by computing the probability of collision over a set of aircraft
0.2
encounters for each system. In these cases, we are often concerned with multiple
metrics. For example, an aircraft collision avoidance system should minimize 0
0 0.2 0.4 0.6 0.8 1
collisions while issuing a small number of alerts to pilots, and a financial trading Alert Rate
strategy may aim to maximize return while minimizing risk.
Figure 3.5. Tradeoff between the
It is often the case that multiple metrics describing system performance are at
alert rate and collision rate for an
odds with one another, and some system designs may perform well on one metric aircraft collision avoidance system.
but poorly on another. For instance, an aircraft collision avoidance system that Each point represents a different
system design.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
3.3. composite metrics 25
Suppose a desired separation for the aircraft in an aircraft collision avoid- Example 3.1. VaR and CVaR for
the loss of separation metric for an
ance environment is 2,000 m. We can define a risk metric f (t ) to summarize aircraft collision avoidance system.
the loss of separation as 2,000 m minus the miss distance. A higher loss of
separation indicates higher risk. The plots below show the VaR and CVaR for
the loss of separation metric for three different distributions over outcomes.
expected value
VaR
CVaR
Although all three distributions have the same expected value, the VaR and
CVaR decrease as we move from left to right. The distribution with the lowest
VaR and CVaR is the least risky because it has better worst-case outcomes.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
26 c hap ter 3 . p roperty specification
minimizes the number of collisions may also increase the number of alerts issued
to pilots, while one that minimizes alerts may increase the number of collisions 1
(figure 3.5). In such cases, we can combine multiple metrics into a single composite 0.8
Collision Rate
metric that captures the trade-offs between different objectives.
0.6
We can compare systems with multiple metrics using the concept of Pareto
optimality. A system design is Pareto optimal3 if we cannot improve one metric 0.4
without worsening another. Given a set of system designs, the Pareto frontier 0.2
consists of the subset of designs that are Pareto optimal. The Pareto frontier 0
0 0.2 0.4 0.6 0.8 1
illustrates the trade-offs between metrics. Figure 3.6 shows the Pareto frontier
Alert Rate
for the aircraft collision avoidance systems shown in figure 3.5. Composite metrics
allow system designers to select a single point on the Pareto frontier. Figure 3.6. Pareto frontier for a set
of aircraft collision avoidance sys-
tem designs. The points that com-
3.3.1 Weighted Metrics prise the Pareto frontier are high-
lighted in blue.
Weighted metrics combine multiple metrics using a vector of weights that re- 3
Pareto optimality is a topic that
flect the relative importance of each metric. Suppose we have a set of metrics was originally explored in the field
of economics. It is named after
f 1 (t ), f 2 (t ), . . . , f n (t ) that we wish to combine into a single metric. The most Italian economist Vilfredo Pareto
basic weighted metric is the weighted sum, which is defined as (1848–1923).
n
f (t ) = Â wi f i ( t ) = w> f ( t ) (3.4)
Collision Rate
i =1
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
3.3. composite metrics 27
Suppose we want to create a composite metric for an aircraft collision avoid- Example 3.2. Using the weighted
sum composite metric to select an
ance system that balances the alert rate and collision rate. Using the weighted aircraft collision avoidance system
sum method, we define the composite metric as the weighted sum of the design along the Pareto frontier.
alert rate and collision rate. Selecting a weight vector then allows us to choose
a point on the Pareto frontier. The plots below show the Pareto frontier for
two different weight vectors. The first weight vector (w1 = [0.8, 0.2]) gives
more weight to minimizing the alert rate, while the second weight vector
(w2 = [0.2, 0.8]) gives more weight to minimizing the collision rate. The
points are colored according to the value of the composite metric.
0.8 w1
Collision Rate
0.6
w2
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Alert Rate Alert Rate
The weight vector will be perpendicular to the Pareto frontier at the best
design point. The weight vector w1 is shown in blue for the first design point
and w2 is shown in blue for the second design point. The best design points
are highlighted in green.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
28 c h ap ter 3 . p roperty specification
The weighted exponential sum is composite metric that combines the weighted
sum and goal metrics as follows:
n
f (t ) = Â w i ( fi ( t ) f goal ) p (3.6)
i =1
wcollision
0.6
them to select the preferred design. By repeating this process for multiple different
pairwise queries of system designs, we can infer the weight that the expert assigns 0.4
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
3.3. composite metrics 29
where we assume that lower values for the composite metric are preferable.7 In 7
If higher values are preferable,
effect, the response to the query further constrains the space of possible weight the inequality in equation (3.8)
should be reversed.
vectors (example 3.3).
Suppose we want to infer the weights for a composite metric that combines the Example 3.3. The effect of a prefer-
ence query on the space of possible
alert rate and collision rate for an aircraft collision avoidance system. When weight vectors for the aircraft colli-
we query a domain expert or stakeholder with system designs f1 = [0.8, 0.4] sion avoidance example.
and f2 = [0.4, 0.8], we find that the expert prefers f1 to f2 . In other words,
the expert prefers the system design with the higher alert rate and lower
collision rate. Since the weight vector must be consistent with this preference
(equation (3.8)), we can further constrain the space of possible weight vectors
as shown in the figure below.
1
f2 f2
0.8
w T f1 < w T f2
wcollision
0.6
f1 f1
0.4
0.2 w T f1 > w T f2
w T f1 = w T f2
0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
walert walert walert
The purple shaded region in the center plot shows the space of possible
weight vectors consistent with the expert’s preference. The plot on the right
shows the space of possible weight vectors consistent with the expert’s prefer-
ence and the constraint that the weights must sum to 1. We can further refine
the space of possible weight vectors by querying the expert with additional
pairs of system designs.
By querying the expert with multiple pairs of system designs, we can iteratively
refine the space of possible weight vectors (figure 3.9). To minimize the number
of times we must query the expert, it is common to select pairs of system designs 8
This method is known as Q-
that with maximally reduce the space of possible weights. For example, one Eval. V. S. Iyengar, J. Lee, and M.
Campbell, “Q-EVAL: Evaluating
method is to select the query that comes closest to bisecting the space of possible Multiple Attribute Items Using
weights.8 After querying the expert a desired number of times, we can select a Queries,” in ACM Conference on
Electronic Commerce, 2001.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
30 c h ap ter 3 . p roperty specification
set of weights from the refined weight space to create a composite metric that
reflects the expert’s preferences. While we could select any value for w that is
consistent with the expert’s responses, it is common to select the weight vector
that maximally separates the system designs that were presented to the expert.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
3.4. logical specifications 31
Suppose we wish to express the following statement using propositional Example 3.4. Constructing a
propositional logic formula from
logic: ‘‘If the agent is in a safe state, then the agent is not in a collision a statement.
state.’’ Let the variable S represent whether the agent is in a safe state and C
represent whether the agent is in a collision state. The propositional logic
statement is S ! ¬C (read as ‘‘S implies not C’’). In this statement, S and C
are atomic propositions because they cannot be broken down further. The
logical formula S ! ¬C is itself a proposition that can be combined with
other propositions to create more complex formulas.
negation (‘‘not’’) and conjunction (‘‘and’’). All other logical expressions such as
disjunction (‘‘or’’), implication (‘‘if-then’’), and biconditional (‘‘if and only if’’) can
be constructed using negation and conjunction. Example 3.4 demonstrates the
construction of a propositional logic formula from a statement.
Table 3.1 shows the propositional logic operators and their construction using
negation and conjunction. We can describe propositional logic formulas using
truth tables, which show the value of the formula as a function of its inputs.
Figure 3.10 shows truth tables for each of the basic propositional logic operators.
Logical operators can also be illustrated as logic gates (figure 3.11), which are
fundamental building blocks for digital circuits.10 Example 3.5 implements the 10
R. Page and R. Gamboa, Essen-
logical operators as functions in Julia. tial Logic for Computer Science. MIT
Press, 2019.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
32 c h ap ter 3 . p roperty specification
P Q P_Q P Q P$Q
false false false false false true
false true true false true false
true false true true false false
true true true true true true
P P
Figure 3.11. Logical operators rep-
P ¬P
Q
P^Q
Q
P_Q
resented using logic gates.
P P$Q
P
P!Q Q
Q
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
3.5. temporal logic 33
Consider two atomic propositions, P and Q. The basic operations of negation Example 3.5. Julia implementa-
tions of propositional logic oper-
(!), conjunction (&&), and disjunction (||) are already implemented in most ators.
programming languages including Julia. Implication P ! Q can be defined
as the operator ⟶ given the Boolean values of P and Q:
julia> ⟶(P,Q) = !P || Q # \longrightarrow<TAB>
⟶ (generic function with 1 method)
julia> P = true;
julia> Q = false;
julia> P ⟶ Q
false
Temporal logic extends first-order logic to specify properties over time. It is partic-
ularly useful for specifying properties of dynamical systems because it allows us
to describe how trajectories should evolve. This section outlines three common
types of temporal logic.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
34 c h ap ter 3 . p roperty specification
Let x be a variable that represents the state of the agent in the grid world Example 3.6. Universal and ex-
istential quantifiers for an obsta-
problem where we must avoid an obstacle (red), and define the domain X cle avoidance problem. The red re-
as the set of states that comprise a particular trajectory. We define a predicate gion indicates an obstacle while the
green region indicates the goal.
function O( x ) that evaluates to true if x is an obstacle state and false otherwise.
To define a specification y1 that states ‘‘for all states in the trajectory, the
agent does not hit an obstacle,’’ we can use the formula:
y1 = 8 x ¬O( x )
y1 = true y1 = f alse
Suppose we also want the agent to reach a goal state while avoiding the
obstacle. We can create an additional predicate G ( x ) that evaluates to true
if x is a goal state and false otherwise. We then create y2 to represent the
statement ‘‘for all states in the trajectory, the agent does not hit an obstacle
and there exists a state in the trajectory in which the agent reaches the goal’’
using the following formula:
y2 = (8 x ¬O( x )) ^ (9 x G ( x ))
y2 = true y2 = f alse
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
3.5. temporal logic 35
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
36 c h ap ter 3 . p roperty specification
For a navigation problem, let y be the LTL property specification that states Example 3.7. LTL formula for an
obstacle avoidance problem where
‘‘eventually reach the goal after passing through the checkpoint and always avoid the a blue checkpoint must be reached
obstacle.’’ First, we define the following predicate functions: before the green goal while avoid-
ing the red obstacle.
F (st ) : the state s at time t contains an obstacle y = true
G (st ) : the state s at time t is the goal
C (st ) : the state s at time t is the checkpoint
y = ⌃ G ( s t ) ^ ¬ C ( s t ) U G ( s t ) ^ ⇤¬ F ( s t )
This formula requires that the agent reaches the goal (⌃G (st )) but that the
goal is not reached until the checkpoint (¬ G (st ) U C (st )). Additionally, the
agent must always avoid obstacles (⇤¬ F (st )). The figure in the caption
shows an example trajectory that satisfies this specification. The following
code constructs the LTL specification:
F = @formula sₜ -> sₜ == [5, 5]
G = @formula sₜ -> sₜ == [7, 8]
C = @formula sₜ -> sₜ == [8, 3]
ψ = LTLSpecification(@formula ◊(G) ∧ 𝒰(¬G, C) ∧ □(¬F))
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
3.5. temporal logic 37
where µ(·) is a real-valued function that operates on the state (example 3.8).
Table 3.3 defines the specifications for the continuum world, inverted pendulum,
and collision avoidance example problems using STL. Algorithm 3.2 provides a
framework for evaluating STL specifications over a trajectory given a time interval.
Suppose we want to implement the following STL formula in code: Example 3.8. Julia implementation
of an STL formula.
‘‘eventually the signal will be greater than 0.5.’’ We can use the
SignalTemporalLogic.jl package to define the predicate µ and the formula
y as follows:
julia> using SignalTemporalLogic
julia> τ = [-1.0, -3.2, 2.0, 1.5, 3.0, 0.5, -0.5, -2.0, -4.0, -1.5];
julia> μ = @formula sₜ -> sₜ > 1.0;
julia> ψ = @formula ◊(μ);
julia> ψ(τ) # check if formula is satisfied
true
The formula is satisfied since the signal eventually becomes greater than 1.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
38 c hap ter 3 . p roperty specification
Continuum World
‘‘Reach the goal without
hitting the obstacle’’ G = @formula s->norm(s.-[6.5,7.5])≤0.5
G (st ): st is in the goal region F = @formula s->norm(s.-[4.5,4.5])≤0.5
F (st ): st is in the obstacle region ψ = @formula ◊(G) ∧ □(¬F)
y = ⌃ G ( s t ) ^ ⇤¬ F ( s t )
Inverted Pendulum
p/4 p/4 ‘‘Keep the pendulum balanced’’
q B = @formula s->abs(s[1])≤π/4
B(st ): |qt | p/4
ψ = @formula □(B)
y = ⇤ B(st )
0
S(st ): | ht | 50 ψ = @formula □(40:41, S)
200
y = ⇤[40,41] S(st )
400
1 11 21 31 41
Time (s)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
3.5. temporal logic 39
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
40 c hap ter 3 . p roperty specification
Let µ0 (st ) be a predicate function that is true if st is greater than 0. The Example 3.9. Robustness of the for-
mulas y1 = ⌃µ0 and y2 = ⇤µ0
following code computes the robustness of the formulas ⌃µ0 and ⇤µ0 over a over a signal t.
signal t:
4
julia> using SignalTemporalLogic
r1
julia> τ = [-1.0, -3.2, 2.0, 1.5, 3.0, 0.5, -0.5, -2.0, -4.0, -1.5]; 2
julia> μ = @formula sₜ -> sₜ > 0.0;
julia> ψ₁ = @formula ◊(μ); 0
s
julia> ρ₁ = ρ(τ, ψ₁)
3.0 2
julia> ψ₂ = @formula □(μ); r2
julia> ρ₂ = ρ(τ, ψ₂) 4
-4.0
2 4 6 8 10
Time
The robustness of the formula ⌃µc is the maximum different between
the signal and the threshold. We would have to decrease all of our signal
values by at least this value to make the formula false. The robustness of the
formula ⇤µc is the minimum difference between the signal and the threshold.
We would have to increase all of our signal values by at least this value to
make the formula true. The figure in the caption shows signal values that
determine the robustness for each formula.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
3.6. reachability specifications 41
operator because the signal must satisfy the property at all time steps in the
interval. Example 3.9 demonstrates this concept.
We can use the robustness metric to assess how close a given system trajectory is
to a failure. Furthermore, if we are able to compute the gradient of the robustness
metric with respect to certain inputs to the system, we can understand how these
inputs affect the overall safety of the system. We will use this idea throughout
the book to understand system behavior. For example, we can uncover the failure
modes of a system by using the robustnes metric to guide the simultor towards a
failure trajectory (see chapter 4 for more details).
Taking the gradient of the robustness metrics requires that the robustness
formula is differentiable over the input space. However, the min and max func-
tions that commonly occur in STL fomulas are not differentiable everywhere.
To address this challenge, we can use smooth approximations of the min and
max functions, such as the softmin and softmax functions, respectively.17 These 17
K. Leung, N. Aréchiga, and
functions are defined as M. Pavone, “Backpropagation
Through Signal Temporal Logic
Âid si exp( si /w) Specifications: Infusing Logical
softmin(s; w) = (3.18) Structure into Gradient-Based
Âdj exp( s j /w) Methods,” The International Journal
of Robotics Research, vol. 42, no. 6,
Âid si exp(si /w) pp. 356–370, 2023.
softmax(s; w) = (3.19)
Âdj exp(s j /w)
4
where s is a signal of length d and w is a weight. As w approaches infinity, the r1
2
softmin and softmax functions approach the mean function. As w approaches
r̃1
zero, the softmin and softmax functions approach the min and max functions 0
r̃
(figure 3.13). We call the robustness metric that uses the softmin and softmax 2
functions the smooth robustness metric. Figure 3.14 shows the gradient of the 4
smooth robustness metric for different values of w.
0 5 10 15
w
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
42 c h ap ter 3 . p roperty specification
w=0 w=1 w=2 w=5 w = 20 Figure 3.14. The gradient of the ro-
bustness function for the formula
in example 3.9 with respect to the
signal values for different values of
w used in the smooth robustness
rr̃
Let S T be the set of states for the inverted pendulum system where the Example 3.10. Reachability specifi-
cation for the inverted pendulum
pendulum has tipped over. In other words, S T is the set of states where the system.
angle q is outside the range [ p/4, p/4]. Our goal is to avoid reaching this
set of states, so we define the reachability specification as
y = ¬⌃ R ( s t ) (3.22)
where R(st ) is the predicate function that checks if the state is in the target
set.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
3.6. reachability specifications 43
a state and an instantiation of truth values for the atomic propositions to the next
state. The accepting states are the states that are visited infinitely often when the
automaton accepts an infinite sequence of states. Example 3.11 shows a simple
Büchi automaton with two states and two propositions.
The figure below shows a simple Büchi automaton that accepts an infinite Example 3.11. Example of a Büchi
automaton with two states and two
sequence of states if the sequence satisfies the LTL formula A ^ B. atomic propositions.
¬( A ^ B) >
A^B
start q1 q2
The automation has two states Q = {q1 , q2 }, where q1 is the initial state and
q2 is the accepting state. The automation has two atomic propositions A and
B. The transition function is defined for all possible combinations of truth
values for the automic propositions:
d ( q1 , A ^ B ) = q2
d ( q1 , A ^ ¬ B ) = q1
d ( q1 , ¬ A ^ B ) = q1
d ( q1 , ¬ A ^ ¬ B ) = q1
d ( q2 , ) = q2
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
44 c h ap ter 3 . p roperty specification
Suppose we have an LTL formula that specifies that we need to visit a check- Example 3.12. Conversion of an
LTL formula to a Büchi automaton.
point before reaching a goal, written as
⌃G ^ ¬ G U C
The resulting automaton is shown below. It has 4 states and the same
atomic propositions as the LTL formula. The accepting state is q4 , and the
LTL formula is satisfied if the automaton visits q4 infinitely often, or in other
words, if a trajectory reaches q4 . The state q2 represents the state where the
agent has reached the goal but has not reached the checkpoint. Once this
state has been reached, the agent will remain in this state forever with no
chance of reaching the accepting state and satisfying the LTL formula. This
state is often ommited in practice to reduce the size of the automaton.
>
¬C ^ ¬ G >
q2
¬C ^ G
C^G
start q1 q4
C ^ ¬G G
q3
¬G
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
3.7. summary 45
The transition model for the new state space is defined by the transition model of
the system T and the transition model of the Büchi automaton d:
8
< T (s0 | s, a) if q0 = d(q, L(s))
T ((s0 , q0 ) | (s, q), a) = (3.24)
:0 otherwise
where L(s) is a labeling function that maps a state s to values for the atomic
propositions of the Büchi automaton. For example, a labeling function for the
system in example 3.12 would map the state st to the values of that specify whether
it is a goal state or checkpoint state.
We refer to the system with the augmented state space as the product system.
The reachability specification for the product system is
y = ⌃ R((st , qt )) (3.25)
3.7 Summary
• Metrics and specifications allow us to quantify and express the desired behavior
of a system.
• For stochastic systems, we often compute metrics over the full distribution of
possible outcomes.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
46 c h ap ter 3 . p roperty specification
( q3 , s )
Product System
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
3.7. summary 47
• Temporal logic extends first-order logic to express properties about how sys-
tems evolve over time.
• Linear temporal logic (LTL) and Signal Temporal Logic (STL) are two common
temporal logics used in control and verification.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
4 Falsification through Optimization
The first set of validation algorithms we will explore relate to falsification. Fal-
sification is the process of finding trajectories of a system that violate a given
specification. Such trajectories are sometimes referred to as counterexamples, failure
trajectories, or falsifying trajectories. We will refer to them in this textbook as failures
for simplicity. The beginning of the chapter introduces a naïve algorithm for find-
ing failures based on direct sampling, with the rest of the chapter focused on more
sophisticated algorithms that use optimization techniques to guide the search
for failures. Optimization-based falsification relies on the concept of disturbances,
which control the behavior of the system. We demonstrate how to frame the
falsification problem as an optimization over disturbance trajectories and outline
several techniques to perform the optimization.
P ( k ) = (1 pfail )k 1
pfail (4.1) 0.15
P(k)
0.1
where k 2 N.
0.05
Equation (4.1) corresponds to the probability mass function of a geometric
distribution with parameter pfail . Figure 4.2 shows an example of a geometric 0
1 2 3 4 5 6 7 8
distribution. The expected value of this distribution, 1/pfail , corresponds to the k
average number of samples required to find a failure. Example 4.1 illustrates this
Figure 4.2. The probability mass
relationship for the aircraft collision avoidance problem. Systems with very low function of a geometric distribu-
failure probabilities will require a large number of samples for direct falsification. tion with parameter pfail = 0.2.
The expected value of this distri-
For example, some aviation systems have failure probabilities on the order of bution is 1/pfail = 5.
10 9 . These systems require 1 billion samples on average to observe a single
failure event. The remainder of the chapter discusses more efficient falsification
techniques.
4.2 Disturbances
We can systematically search for failures by taking control of the sources of ran-
domness in the system. We control these sources of randomness using disturbances.
To incorporate disturbances into a system, we rewrite its sensor, agent, and en-
vironment models by breaking up their stochastic and deterministic elements.
For example, the observation model o ⇠ O(· | s) can be written as a deterministic
function of the current state s and a stochastic disturbance xo such that
o = O(s, xo ), xo ⇠ Do (· | s) (4.2)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
4.2. disturbances 51
Suppose we want to find failures of an aircraft collision avoidance system Example 4.1. Direct falsification ap-
plied to the aircraft collision avoid-
using direct falsification. In this scenario, a failure is a collision between two ance problem with different lev-
aircraft, which occurs when the relative altitude to the intruder aircraft h els of noise applied to the transi-
tions. There are four state variables
is within ±50 m and the time to collision tcol is zero. The collision avoid- for the collision avoidance problem.
ance environment applies additive noise with standard deviation s to the These plots show how two of these
relative vertical rate of the intruder aircraft dh at each time step. This noise state variables evolve for each tra-
jectory. The horizontal axis is the
accounts for variation in pilot response to advisories and the intruder flight time to collision tcol , and the verti-
path. The plots below use different values of s and show the trajectory sam- cal axis is the altitude relative to the
intruder aircraft h. Appendix E.1
ples produced before finding the first failure with the first failure trajectory provides additional details about
highlighted in red. this problem.
s = 5m s = 3m s = 2m
400
200
h (m)
200
400
40 30 20 10 0 40 30 20 10 0 40 30 20 10 0
tcol (s) tcol (s) tcol (s)
As s decreases, failures become less likely, and more trajectories are required
to find a failure. In this example, the first failure is found after 41 samples
with s = 5 m, 84 samples with s = 3 m, and 522 samples with s = 2 m.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
52 c h ap ter 4 . falsification throug h optimiz ation
Suppose we model a sensor using a Gaussian noise model such that O(o | Example 4.2. Separating the sto-
chastic and deterministic elements
s) = N (o | s, S). We can rewrite this sensor model as of a sensor with a Gaussian noise
model.
o = s + xo , xo ⇠ N (· | 0, S)
The agent’s policy and the environment’s transition model can also be decom-
posed:
a = p (o, x a ), x a ⇠ Da (· | s) (4.3)
s0 = T (s, a, xs ), xs ⇠ Ds (· | s, a) (4.4)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
4.3. fuz z ing 53
4.3 Fuzzing
Unlike direct sampling, which samples from the nominal distribution over system 2
Fuzzing is a well-known concept
trajectories, we can find failures more efficiently by sampling from a trajectory in testing of traditional software.
It refers to the generation of off-
distribution designed to stress the system. We refer to this process as fuzzing.2 nominal inputs to a program to un-
Before we can perform fuzzing, we need to define the components of a trajectory cover potential bugs or failures and
was first introduced in B. P. Miller,
distribution. There are two sources of randomness in a trajectory rollout: the L. Fredriksen, and B. So, “An Em-
initial state and the disturbances applied at each time step. Therefore, we can fully pirical Study of the Reliability of
capture the distribution over trajectories by specifying an initial state distribution UNIX Utilities,” Communications of
the ACM, vol. 33, no. 12, pp. 32–44,
and a disturbance distribution for each time step (algorithm 4.3). 1990.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
54 c h ap ter 4. falsification throug h optimiz ation
uses the default initial state and disturbance distributions for the components
of the system. Nominal trajectory distributions are stationary, meaning that the
disturbance distribution does not depend on time.
initial_state_distribution(p::NominalTrajectoryDistribution) = p.Ps
disturbance_distribution(p::NominalTrajectoryDistribution, t) = p.D
depth(p::NominalTrajectoryDistribution) = p.d
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
4.3. fuz z ing 55
Suppose we want to find failures of the inverted pendulum system with Example 4.3. Fuzzing applied
to the inverted pendulum system.
an additive noise sensor with Do (o | s) = N (o | 0, S) and S = 0.01I. If The plots in the top row show the
we collect 100 samples with this nominal distribution, we do not find any sampled disturbances for the sen-
failures. However, if we define a new distribution and increase the standard sor noise on each state variable.
deviation of the sensor noise on each variable from 0.1 to 0.15 (referred to The plots on the bottom row show
as fuzzing), we are able to find two failures of the system in the first 100 the corresponding trajectories for
q with failures highlighted in red.
samples. The following code can be used to define the fuzzing distribution: By slightly increasing the standard
struct PendulumFuzzingDistribution <: TrajectoryDistribution deviation of the simulated sensor
Σₒ # sensor disturbance covariance noise, we are able to uncover two
d # depth failures.
end
function initial_state_distribution(p::PendulumFuzzingDistribution)
return Product([Uniform(-π / 16, π / 16), Uniform(-1., 1.)])
end
function disturbance_distribution(p::PendulumFuzzingDistribution, t)
D = DisturbanceDistribution((o)->Deterministic(),
(s,a)->Deterministic(),
(s)->MvNormal(zeros(2), p.Σₒ))
return D
end
depth(p::PendulumFuzzingDistribution) = p.d
The plots show the disturbances and trajectories for both distributions.
Nominal Fuzzing
0.5 0.5
xo,w
xo,w
0 0
0.5 0.5
0.5 0 0.5 0.5 0 0.5
xo,q xo,q
1 1
q (rad)
q (rad)
0 0
1 1
0 0.5 1 1.5 2 0 0.5 1 1.5 2
Time (s) Time (s)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
56 c hap ter 4 . falsification throug h optimiz ation
The falsification problem can be reformulated as a search over the space of initial
states and disturbances. Algorithm 4.6 performs a trajectory rollout given an initial
state and a sequence of disturbances. We refer to this sequence of disturbances as
a disturbance trajectory x = ( x1 , . . . , xd ). Unlike algorithm 4.5, algorithm 4.6 is
deterministic. The initial state s and disturbance trajectory x fully determine the
resulting trajectory t.
minimize f (t )
s,x
(4.5)
subject to t = Rollout(s, x)
The rest of this chapter discusses different objective functions and optimization
techniques for solving the optimization problem in equation (4.5).
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
4.5. objective functions 57
Objective functions guide the search for failure trajectories. In general, a good
objective function should output lower values for trajectories that are closer to a
failure. The specific measure of closeness used is dependent on the application.
For example, in the aircraft collision avoidance problem, we may use the vertical
miss distance between the aircraft as the objective value.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
58 c hap ter 4. falsification through optimiz ation
Suppose we want to compute the robustness objective for the inverted pendu- Example 4.4. Extracting an ini-
tial state and disturbance trajectory
lum system where the initial state is always s = [0, 0]. We write the extract from a vector of real values for the
function as follows: inverted pendulum system.
function extract(env::InvertedPendulum, x)
s = [0.0, 0.0]
𝐱 = [Disturbance(0, 0, x[i:i+1]) for i in 1:2:length(x)]
return s, 𝐱
end
The function extracts the sensor disturbances from the real-valued vector x
to create a disturbance trajectory 𝐱. It then returns the fixed initial state s and
the disturbance trajectory.
failure requires specifying the distribution over trajectories and using its prob-
ability density function to evaluate likelihoods. Assuming that the initial state
and disturbances are sampled independently from one another, the probability
density function of a trajectory distribution p is
d
p ( t ) = p ( s1 ) ’ D ( x i | s i , a i , o i ) (4.6)
i =1
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
4.5. objective functions 59
If the input trajectory does not produce a failure, equation (4.7) uses the robust-
ness to guide the search toward any failure. If the input does produce a failure
trajectory, it uses the negative likelihood of the trajectory to guide the search
toward more likely failures. Figure 4.3 compares a search for failures with a
search for the most likely failure on the grid world problem. While the robustness
objective finds failures that move directly toward the obstacle, the most likely
failure objective finds a failure that stays close to the nominal path.
The objective function in equation (4.7) leads to multiple practical challenges.
For example, to encourage the optimization algorithm to find failures, we must
ensure that failures never have a higher objective value than successes. Since
r(t, y) 0 and p(t ) 0, equation (4.7) satisfies this condition. However, p(t )
can be very small for long trajectories, which can lead to numerical stability issues.
Using log likelihood improves numerical stability but breaks the condition that
failures never have a higher objective value than successes.
This numerical instability as well as the discontinuity at the point of a failure cre-
ates challenges for first- and second-order optimization algorithms (section 4.6).
Furthermore, while the global minimum of the objective function in equation (4.7)
corresponds to the most likely failure of the system, many optimizers are only
guaranteed to find local minima. Due to this fact and the numerical stability
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
60 c h ap ter 4 . falsification through optimiz ation
issues, other objective functions may lead to the discovery of more likely failures
in practice.
Another common objective for most likely failure analysis is
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
4.6. optimiz ation algorithms 61
q (rad)
methods start from an initial design point and incrementally improve it until 0
some convergence criteria is met. At each iteration, they use a local model of
the objective function at the current design point to determine a direction of 1
improvement. They then take a step in this direction to compute the next design 0 0.2 0.4 0.6 0.8 1
point. Some methods use the gradient or Hessian of the objective function with Time (s)
respect to the current design point to create the local model. These methods are
Figure 4.4. First-order method ap-
called first-order and second-order methods, respectively. Figure 4.4 shows the plied to falsify the inverted pendu-
result of applying a first-order method called gradient descent to find failures for lum example. The plot shows suc-
cessive iterations of the algorithm,
the inverted pendulum example. with darker trajectories indicating
While the gradient and Hessian provide a very powerful signal for optimiza- later iterations. Failures are high-
tion algorithms, they are not always available. 6 Some simulators do not provide lighted in red. The algorithm gets
closer to a failure with each iter-
access to the internal model of the system, making exact computation of the ation until it eventually begins to
gradient infeasible. We often refer to such simulators as black-box simulators. An- find failures.
other category of optimization algorithms called direct methods is better suited
6
Gradient information is a strong
for systems with black-box simulators. They traverse the input space using only enough signal to effectively op-
information from function evaluations, eliminating the need for access to the timize machine learning models
system’s internal model. with billions of parameters.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
62 c h ap ter 4 . falsification throug h optimiz ation
The Optim.jl package provides implementations of several optimization al- Example 4.5. Applying a second-
order method called L-BFGS to
gorithms. In this example, we show how to use the Optim.jl implementation falsify the inverted pendulum ex-
of a second-order method called L-BFGS to falsify the inverted pendulum ample. We use the open-source
system. We define the optimizer function for algorithm 4.11 and run the implementation of L-BFGS in the
algorithm using the robustness objective as follows: Optim.jl package. The plot shows
the trajectory of the pendulum
using Optim
for the initial point (green) and
function lbfgs(f, sys, ψ)
the failure trajectory discovered af-
x₀ = zeros(42)
alg = Optim.LBFGS() ter one iteration (red). For more
options = Optim.Options(store_trace=true, extended_trace=true) information on the L-BFGS algo-
results = optimize(f, x₀, alg, options; autodiff=:forward) rithm, see J. Nocedal, “Updating
τs = [rollout(sys, extract(sys.env, iter.metadata["x"])...) Quasi-Newton Matrices with Lim-
for iter in results.trace] ited Storage,” Mathematics of Com-
return filter(τ->isfailure(ψ, τ), τs) putation, vol. 35, no. 151, pp. 773–
end 782, 1980.
objective(x, sys, ψ) = robustness_objective(x, sys, ψ, smoothed=true)
alg = OptimizationBasedFalsification(objective, lbfgs) 1
failures = falsify(alg, inverted_pendulum, ψ) Failure
Trajectory
q (rad)
In this implementation, we are optimizing over a disturbance trajectory with 0
Initial
depth d = 21. Since each sensor disturbance is two-dimensional, the length of Trajectory
each design point is 42. The lbfgs function starts with an initial design point
1
of all zeros, specifies options to store the results of each iteration, and runs 0 0.2 0.4 0.6 0.8 1
the algorithm using ForwardDiff.jl to compute gradients. It then extracts Time (s)
the initial state and disturbance trajectory from each iteration and performs
a rollout of the system. Finally, it filters the resulting trajectories to return
failure trajectories. It is important that we specify the objective as smoothed
robustness so that the gradients are well-defined. The plot on the right shows
the progression of the algorithm. L-BFGS converges to a failure trajectory
after a single iteration.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
4.7. summary 63
Local descent methods often get stuck in local optima. Population methods at-
tempt to overcome this drawback by performing optimization using a collection of
design points. The points in a population are sometimes referred to as individuals.
Population methods begin with an initial population that is spread out over the
design space. At each iteration, they use the current function value of each indi-
vidual to move the population toward the optimum. Because population methods
spread samples over the entire design space rather than incrementally improving
a single point, they may find a more diverse set of failures. For example, the
population method in figure 4.5 is able to find failures for the pendulum in both
directions. High-dimensional problems with long time horizons may require a
large number of samples to cover the design space. However, population methods
are often easy to parallelize, which can improve efficiency.
4.7 Summary
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
64 c h ap ter 4. falsification throug h optimiz ation
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
5 Falsification through Planning
Shooting methods use optimization to find a feasible path between two points,1 1
The term shooting method is
and they can be used in the context of falsification to produce feasible failure based on the analogy of shooting
at a target from a cannon. Shoot-
trajectories. These methods break the trajectory optimization problem into a set ing methods start at an initial point
of smaller problems by optimizing over a sequence of trajectory segments. A and ‘‘shoot’’ trajectories toward a
target point until a feasible path
trajectory t can be partitioned into n segments such that t = (t1 , . . . , tn ). Each between the initial point and tar-
trajectory segment ti is defined by an initial state si and a sequence of disturbances get is found. Shooting methods
xi of length di . Given si and xi , we can compute the resulting trajectory ti by orginated from research on bound-
ary value problems. A more de-
performing a rollout. tailed review with an implemena-
The defect between two trajectory segments is the distance between the final tion can be found in section 18.1 of
the reference by W. H. Press, S. A.
state of the first segment and the initial state of the second segment. A set of Teukolsky, W. T. Vetterling, and B. P.
trajectory segments forms a feasible trajectory if the defect of all consecutive Flannery, Numerical Recipes 3rd Edi-
trajectory segments is 0. In other words, the final state of ti must match the initial tion: The Art of Scientific Computing.
Cambridge University Press, 2007.
state of ti+1 for all i 2 {1, . . . , n 1}. This requirement leads to the following
66 c h ap ter 5. falsification throug h planning
optimization problem
minimize f (t1 , . . . , tn )
s1 ,x1 ,...,sn ,xn
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
5.2. tree search 67
Iteration 2 Iteration 4 Iteration 6 Iteration 8 Converged Figure 5.1. Multiple shooting ap-
plied to the continuum world ex-
ample to find a path from an initial
point to the obstacle. We use four
trajectory segments, and the colors
denote which segment end points
should connect. The plots show the
trajectory segments at different iter-
ations of the L-BFGS optimization
algorithm.
where l is a weighting parameter that controls how heavily the defect is penalized.
Algorithm 5.1 implements this objective when f is the temporal logic robustness.
We can apply any of the optimization algorithms discussed in appendix C
to the optimization problem in equation (5.2). Compared to the optimization
problems in the previous chapter, minimizing the defect between the trajectory
segments adds complexity to the problem. This added complexity can make it
more difficult to find a feasible failure trajectory. Figure 5.1 shows an example that
uses a gradient-based optimization technique called L-BFGS2 to find a failure tra- 2
J. Nocedal, “Updating Quasi-
jectory for the continuum world problem. For systems with black-box simulators, Newton Matrices with Limited
Storage,” Mathematics of Computa-
the direct methods described in appendix C may struggle to find feasible failure tion, vol. 35, no. 151, pp. 773–782,
trajectories. Instead, we can use direct methods that were designed specifically 1980.
for multiple shooting.3 3
For an example of a multiple
shooting algorithm designed for
systems with black-box simulators,
5.2 Tree Search see A. Zutshi, J. V. Deshmukh, S.
Sankaranarayanan, and J. Kapinski,
“Multiple Shooting, CEGAR-Based
Tree search algorithms iteratively construct a tree structure that represents the Falsification for Hybrid Systems,”
space of possible trajectories. Each node in the tree represents a state, and each in International Conference on Embed-
ded Software, 2014.
edge represents a transition between states that is the result of applying a partic-
ular disturbance. Each path through the tree corresponds to a feasible trajectory
for the system. Tree search algorithms start in an initial state and iteratively grow
the tree in an attempt to find feasible failure trajectories.
At each iteration, these algorithms perform the steps illustrated in figure 5.2.
They first select a node from the tree to extend. This selection is typically based
on a heuristic designed to grow the tree toward failures. Next, they extend the
selected node by choosing a disturbance and adding a new child node at the
resulting next state. We can terminate the algorithm after a fixed number of
iterations or when a failure trajectory is discovered.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
68 c h ap ter 5. falsification throug h planning
Algorithm 5.2 implements the generic tree search algorithm. It runs for a fixed
number of iterations before returning all failures in the tree. Algorithm 5.3 extracts
failure trajectories from a tree by enumerating all paths in the tree and checking
for failures. Specific implementations of tree search algorithms differ in how they
implement the select and extend functions. We discuss two categories of tree
search algorithms in the next two sections.
s s s s
Some tree search algorithms use heuristics to explore the space of possible tra-
jectories. The rapidly exploring random trees (RRT) algorithm, for example, uses
heuristics to iteratively extend the search tree toward randomly selected states in
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
5.3. heuristic search 69
the state space.4 In the context of falsification, we use RRT to efficiently explore 4
RRT was designed to efficiently
the space of possible disturbance trajectories in search of a failure trajectory. enumerate trajectories in high-
dimensional spaces, particularly
Algorithm 5.4 implements the select and extend steps for the RRT algorithm. for systems with complex dynam-
In the select step, RRT randomly samples a goal state and computes an objective ics. The algorithm was originally
proposed in the context of robotic
value for each node in the current tree based on the sampled goal state. This path planning. For more informa-
objective is typically related to the distance between each node and the goal state. tion on path planning algorithms,
The algorithm then selects the node with the lowest objective value to pass to see S. LaValle, “Planning Algo-
rithms,” Cambridge University Press,
the extend step. In the extend step, RRT selects a disturbance, simulates one step vol. 2, pp. 3671–3678, 2006.
forward in time from the selected node, and adds the resulting edge and child
node to the tree.
Several variants of RRT differ in how they sample goal states, compute objec-
tives, and select disturbances. Algorithm 5.5 implements a version of the RRT
algorithm that samples goal states uniformly from the state space. It then uses
the Euclidean distance between the each node and the goal state as the objective.
In the extend step, the disturbance is randomly sampled from the nominal dis-
turbance distribution for the system. Example 5.1 applies this algorithm to the
continuum world problem.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
70 c h ap ter 5. falsification throug h planning
random_goal(tree, lo, hi) = rand.(Distributions.Uniform.(lo, hi)) Algorithm 5.5. Functions for the
RRT algorithm. The first function
function distance_objectives(tree, sgoal) samples a goal state uniformly
return [norm(sgoal .- node.state) for node in tree] from the state space. The lo and
end hi inputs specify the lower and up-
per bounds of the state variables.
function random_disturbance(sys, node) The second function computes the
D = DisturbanceDistribution(sys) Euclidean distance between each
o, a, s′, x = step(sys, node.state, D) node in the tree and the goal state.
return x The third function samples a dis-
end turbance from the nominal distur-
bance distribution for the system.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
5.3. heuristic search 71
Suppose we want to apply RRT to search for failures for the continuum world Example 5.1. Basic RRT applied to
the continuum world example. The
system. We can use the following code to run the basic RRT algorithm for plots show snapshots of the search
100 iterations. tree after 5, 15, and 100 iterations.
select_goal(tree) = random_goal(tree, [0.0, 0.0], [10.0, 10.0]) The stars show the next goal state
compute_objectives(tree, sgoal) = distance_objectives(tree, sgoal) and highlighted nodes show the
select_disturbance(tree, node) = random_disturbance(tree, node) node selected to extend next.
alg = RRT(select_goal, compute_objectives, select_disturbance, 100)
failures = falsify(alg, cw, ψ)
The plots below show two snapshots of the search tree after 5 and 15 iterations
as well as the final tree after 100 iterations. After 100 iterations, RRT did not
find any failure trajectories. Although goal states are sampled throughout
the state space, the disturbances are sampled from the nominal disturbance
distribution. Since the nominal disturbance distribution represents only small
deviations from the nominal path, the tree closely follows the nominal path
toward the goal. We can improve the performance of the tree search using
the heuristics discussed in section 5.3.1.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
72 c hap ter 5. falsification throug h planning
using the goal state to select the disturbance. Specifically, we want to select the
disturbance that leads to the next state that is closest to the goal state.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
5.3. heuristic search 73
The failure region for the continuum world example is the set of states within Example 5.2. Example of sampling
goal states from the failure region
the red obstacle, which is a circle centered at (4.5, 4.5) with radius 0.5. We of the continuum world problem.
can uniformly sample from this region using the following code:
function failure_goal(tree)
r = rand(Uniform(0, 0.5))
θ = rand(Uniform(0, 2π))
return [4.5, 4.5] .+ [r*cos(θ), r*sin(θ)]
end
The code uniformly samples a radius between 0 and 5 and an angle between
0 and 2p. It then converts these samples to a state in the failure region.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
74 c h ap ter 5. falsification through planning
Since dispersion considers only the largest ball that can be placed in S , it tends
to be a conservative measure of coverage. Furthermore, it is difficult to compute
for high-dimensional spaces. An approximate metric called average dispersion
overcomes these drawbacks.6 Average dispersion is computed on a grid of n 6
J. M. Esposito, J. Kim, and V. Ku-
points with spacing d in each dimension. It is calculated as mar, “Adaptive RRTs for Validat-
ing Hybrid Robotic Control Sys-
n min(d j (V ), d) tems,” in Algorithmic Foundations
1
average dispersion = Â (5.4) of Robotics, Springer, 2005, pp. 107–
n d
j =1 121.
where d j (V ) is the distance from the jth grid point to the nearest point in V .
function average_dispersion(points, lo, hi, lengths) Algorithm 5.7. Algorithm for com-
points_norm = [(point .- lo) ./ (hi .- lo) for point in points] puting average dispersion of a set
ranges = [range(0, 1, length) for length in lengths] of points on a space bounded by
δ = minimum(Float64(r.step) for r in ranges) lo and hi. It uses a grid speci-
grid_dispersions = [] fied by lengths, which contains
for grid_point in Iterators.product(ranges...) the number of grid points in each
dmin = minimum(norm(grid_point .- p) for p in points_norm) dimension. The algorithm first nor-
push!(grid_dispersions, min(dmin, δ) / δ) malizes the points to lie in the unit
end hypercube. It then creates the grid
return mean(grid_dispersions) over the unit hypercube and com-
end putes the average dispersion using
equation (5.4).
Algorithm 5.7 computes average dispersion given a set of points and a bounded
region. The term in the numerator of equation (5.4) is the radius of the largest
ball centered at each grid point that does not contain any points in V or other grid
points. Dividing by d ensures that the values for average dispersion range between
0 and 1, and subtracting the average dispersion from 1 results in a coverage metric 7
The average dispersion coverage
that ranges between 0 and 1.7 Figure 5.6 shows the difference between dispersion metric will be 1 if V contains all of
the grid points. A finer grid will
and average dispersion. result in better coverage estimates
Another common coverage metric is discrepancy. The key insight behind dis- but at a greater computational cost.
crepancy is that if a set of points covers a space evenly, then a randomly chosen
subset of the space should contain a fraction of samples proportional to the frac-
tion of volume occupied by the subset. Discrepancy is defined as the worst case
hyperrectangular subset:
#(V \ H) vol(H)
discrepancy = sup (5.5)
H✓S #(V ) vol(S)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
5.3. heuristic search 75
where H is a hyperrectangular subset of S and #(V \ H) and #(V ) are the number
of points in V that lie in H and the total number of points in V respectively. We use
vol(H) and vol(S) to denote the n-dimensional volume of H and S respectively,
which can be obtained by multiplying the side lengths.
The worst-case hyperrectangle that determines the discrepancy of a set of
points is typically a small region containing many points or a large region with
few points. Figure 5.7 visualizes the discrepancy metric. Discrepancy approaches
1 when all points overlap and approaches 0 when all possible hyperrectangular Figure 5.7. Visualization of the
discrepancy metric. The rectangles
subsets have their proper share of points. In general, discrepancy is difficult to indicate two candidates for the
compute exactly, especially in high dimensions. worst case rectangle used to define
discepancy. Discrepancy is deter-
Star discrepancy is a special case of discrepancy that is easier to compute and is mined by a rectangle with small
often used in practice. Instead of considering all possible hyperrectangular subsets, area and many points (top) or a
star discrepancy considers only hyperrectangular subsets of the unit hypercube rectangle with large area and few
points (bottom).
that have a vertex at the origin. We can always normalize any hyperrectangular
space S to the unit hypercube by dividing by the side length in each dimension. 8
E. Thiémard, “An Algorithm to
Compute Bounds for the Star Dis-
Given these constraints, it is possible to compute lower and upper bounds on
crepancy,” Journal of Complexity,
star discrepancy.8 We first partition the unit hypercube B into a finite number of vol. 17, no. 4, pp. 850–880, 2001.
Examples of other approximations
can be found in Y.-D. Zhou, K.-T.
Fang, and J.-H. Ning, “Mixture Dis-
crepancy for Quasi-Random Point
Sets,” Journal of Complexity, vol. 29,
no. 3-4, pp. 283–301, 2013.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
76 c h ap ter 5. falsification through planning
Star Discrepancy
0.8
state in RRT. In particular, we want to select the goal state that would result in 0.6
the greatest increase in coverage if added to the current tree. While it is difficult 0.4
to determine the goal state exactly, we can approximate this process by drawing 0.2
samples from the state space, computing the difference in coverage for each
0
sample, and selecting the sample with the largest increase.9 The samples may 20 40 60 80 100
be selected from a grid (figure 5.10) or drawn uniformly from the state space Grid Resolution
function star_discrepancy(points, lo, hi, lengths) Algorithm 5.8. Algorithm for com-
n, dim = length(points), length(lo) puting upper and lower bounds
𝒱 = [(point .- lo) ./ (hi .- lo) for point in points] on the star discrepancy of a set
ranges = [range(0, 1, length)[1:end-1] for length in lengths] of points on a space bounded
steps = [Float64(r.step) for r in ranges] by lo and hi. It uses a partition
ℬ = Hyperrectangle(low=zeros(dim), high=ones(dim)) specified by lengths, which con-
lbs, ubs = [], [] tains the number of subrectangles
for grid_point in Iterators.product(ranges...) in each dimension. The algorithm
h⁻ = Hyperrectangle(low=zeros(dim), high=[grid_point...]) first normalizes the points to lie in
h⁺ = Hyperrectangle(low=zeros(dim), high=grid_point .+ steps) the unit hypercube. It then creates
𝒱h⁻ = length(filter(v -> v ∈ h⁻, 𝒱)) the partition over the unit hyper-
𝒱h⁺ = length(filter(v -> v ∈ h⁺, 𝒱)) cube and computes the upper and
push!(lbs, max(abs(𝒱h⁻ / n - volume(h⁻) / volume(ℬ)), lower bounds on star discrepancy
abs(𝒱h⁺ / n - volume(h⁺) / volume(ℬ)))) using equation (5.6). We use the
push!(ubs, max(𝒱h⁺ / n - volume(h⁻) / volume(ℬ), LazySets.jl package to represent
volume(h⁺) / volume(ℬ) - 𝒱h⁻ / n)) hyperrectangles.
end
return maximum(lbs), maximum(ubs)
end
Average Dispersion Star Discrepancy Figure 5.10. Selecting the next goal
state for RRT applied to the con-
tinuum world problem using av-
erage dispersion and star discrep-
ancy coverage metrics. The plots
show the grid points used as can-
didates for the next goal state. The
color of each grid point indicates
the increase in coverage that would
result from adding that grid point
to the tree with darker colors indi-
cating a greater increase. For star
discepancy, the colors represent
the lower bound. The star indicates
the goal state selected by RRT. Be-
cause star discrepancy only focuses
on the worst-case hyperrectangle,
it is not as smooth as average dis-
persion.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
78 c hap ter 5. falsification throug h planning
Suppose we want to apply coverage heuristics when using RRT on the con- Example 5.3. RRT applied to the
continuum world problem using
tinuum world problem. The following code implements a version of the coverage heuristics. The plots illus-
select_goal function that uses coverage based on average dispersion to trate the effect of selecting the next
guide the search. goal state based on coverage rather
function select_goal(tree; m=5) than randomly selecting it.
a, b, lengths = [0, 0], [10, 10], [10, 10]
points = [node.state for node in tree]
sgoals = [rand.(Distributions.Uniform.(a, b)) for _ in 1:m]
dispersions = [average_dispersion([points..., sgoal], a, b, lengths)
for sgoal in sgoals]
coverages = 1 .- dispersions
return sgoals[argmax(coverages)]
end
We first collect the states visited so far from the nodes of the tree and sample
m potential goal states uniformly from the state space. We then compute the
new average dispersion if each goal state were added to the tree. The goal
state that results in the greatest increase in coverage is selected. The plots
show the resulting trees when using random goals and coverage-based goals.
Using the coverage-based goal selection results in a wider tree that covers
more of the state space.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
5.3. heuristic search 79
We can terminate the tree search when the growth metric is sufficiently small.
It is important to note that this growth metric does not provide any guarantees
about the coverage of the search tree. Even when growth is small, there may still be
a number less than 1 (see figure 5.11). Some states are extremely unlikely to be 0.4
reached under the nominal disturbance model. 0.2
0
5.3.3 Alternative Objectives 100 200 300
Iteration
As noted in the previous chapter, we may want to go beyond a simple search for
Figure 5.11. The average disper-
failures and incorporate other objectives into the search process. For example, we sion coverage metric over iterations
may be interested in finding the shortest path to failure or the most likely failure. of RRT applied to the continuum
world problem.
We can incorporate these objectives into RRT by modifying how we compute the
objectives in the select step (algorithm 5.9).
First, we define a cost function c that maps a node to a cost of transitioning
to the node from its parent. For example, the cost might be a measure of the
distance between the node’s state and its parent’s state. To ensure that the tree
search algorithm is still encouraged to reach the goal, all costs must be positive.
The total cost of a path is the sum of the costs of all nodes in the path. Our goal is
to find the path to the goal with the lowest total cost.
We compute an objective for each node consisting of two components: the
total cost of the current path from the root to the node and an estimate of the
remaining cost to get from the node to the goal state. The remaining cost estimate 10
This algorithm is a simplified
comes from a heuristic function h. One potential heuristic is the distance from the version of the RRT⇤ algorithm.
S. Karaman and E. Frazzoli, “In-
current node to the goal state. Algorithm 5.9 implements this process given a cost cremental Sampling-Based Algo-
function and heuristic function.10 It provides default cost and heuristic functions rithms for Optimal Motion Plan-
ning,” Robotics Science and Systems
that will guide the search toward the shortest path. VI, vol. 104, no. 2, pp. 267–274,
To search for the most likely failure, we can use a cost function related to the 2010.
negative log likelihood of the disturbance for the current node. We add a constant
factor of the maximum possible log likelihood according to the disturbance dis-
tribution to ensure that the cost is positive. For the heuristic function, we need to
estimate the log likelihood of the remaining path required to reach the goal state.
One option is to use the distance to the goal state as a proxy for this value since
longer paths tend to result in lower negative log likelihoods. Adding a scaling
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
80 c h ap ter 5. falsification through planning
Shortest
Failure
factor to the cost function to balance between the heuristic and cost may improve
performance. Figure 5.12 shows the results from using RRT to find the shortest
path to failure and most likely failure for the continuum world problem. Figure 5.12. The nominal path
for the continuum world problem
While algorithm 5.9 will often find a low cost path to failure, it is not necessarily compared to the shortest path to
guaranteed find the path with the lowest possible cost. Certain conditions on failure and the most likely failure
path found by RRT. The most likely
the nature of the problem and the heuristic function are required to guarantee failure path stays closer to the nom-
optimality. Algorithm 5.9 will converge to the optimal path if the state space inal path before moving toward the
and disturbance space are discrete and the heuristic function is admissible.11 A obstacle.
11
When these conditions are met,
heuristic is admissible if it is guaranteed to never overestimate the cost of reaching the algorithm is the same as the
the goal state. In shortest path problems, the straight-line distance to the goal A⇤ search algorithm. P. E. Hart, N. J.
Nilsson, and B. Raphael, “A For-
state is an admissible heuristic. Example 5.4 demonstrates this result on the grid
mal Basis for the Heuristic Deter-
world problem. mination of Minimum Cost Paths,”
IEEE Transactions on Systems Sci-
ence and Cybernetics, vol. 4, no. 2,
5.4 Monte Carlo Tree Search pp. 100–107, 1968.
For a survey, see C. B. Browne, E.
12
Monte Carlo tree search (MCTS) (algorithm 5.10) is a tree search algorithm that Powley, D. Whitehouse, S. M. Lu-
cas, P. I. Cowling, P. Rohlfshagen, S.
balances between exploration and exploitation.12 It explores by selecting nodes Tavener, D. Perez, S. Samothrakis,
that have not been visited many times and exploits by biasing the search tree and S. Colton, “A Survey of Monte
Carlo Tree Search Methods,” IEEE
toward paths that seem most promising. MCTS determines which paths are most Transactions on Computational Intel-
ligence and AI in Games, vol. 4, no. 1,
pp. 1–43, 2012.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
5.4. monte carlo tree search 81
Since the state space and disturbance space for the grid world problem are Example 5.4. Example of using
RRT to find the shortest path to
discrete, we are guaranteed to find the shortest path to failure and the most failure and most likely failure for
likely failure path as long as we select an admissible heuristic function. For the grid world problem. The plots
show the search tree at different it-
the shortest path to failure, an admissible heuristic is the Euclidean distance erations of the algorithm. The most
between the current state and the goal state. This distance will always be less likely failure path stays closer to
than or equal to the actual cost of reaching the goal state since the shortest the nominal path (highlighed in
gray) before moving toward the ob-
path between two points is a straight line. For the most likely failure path, stacle.
we can use the likelihood of a straight line trajectory from the current state to
the goal state assuming that it used the most likely disturbance at each step.
The plots show the results. As in the continuum world problem (figure 5.12),
the most likely failure path stays closer to the nominal path before moving
toward the obstacle.
Iteration 10 Iteration 25 Converged
Shortest Path
Most Likely Path
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
82 c h ap ter 5. falsification throug h planning
promising by maintaining a value function Q(s, x ) for each node in the tree. Given
a failure objective (section 4.5), Q(s, x ) represents the expected future objective
value when applying disturbance x from state s. MCTS searches for the path with
the lowest objective value.13 13
When the objective function is
In the select step, MCTS traverses the tree starting at the root node. At each the most likely failure objective,
this technique is sometimes re-
node, we determine whether to select it for the extend step based on its current ferred to as adaptive stress testing. R.
number of children and number of visits N (s). Specifically, we extend the node if Lee, O. J. Mengshoel, A. Saksena,
R. W. Gardner, D. Genin, J. Silber-
the number of children is less than or equal to kN (s)a , where k and a are algorithm mann, M. Owen, and M. J. Kochen-
hyperparameters. This process is referred to as progressive widening. If the number derfer, “Adaptive Stress Testing:
Finding Likely Failure Events with
of children exceeds this value, we continue to traverse the tree using a heuristic
Reinforcement Learning,” Journal
that balances between exploration and exploitation. of Artificial Intelligence Research,
vol. 69, pp. 1165–1201, 2020.
Iteration 100 Iteration 200 Failure Found Figure 5.13. MCTS applied to find
a failure in the continuum world
problem. Darker nodes and edges
were visited more often. MCTS
finds a failure (highlighted in red)
after 258 iterations.
A common heuristic is the lower confidence bound (LCB) (algorithm 5.11), which
is defined as s
log N (s)
Q(s, x ) c (5.8)
N (s, x )
where N (s, x ) is the number of times we took the path corresponding to distur-
bance x from the node corresponding to state s. The first term in equation (5.8)
exploits our current estimate of how promising a particular path is based on the
value function, and the second term is an exploration bonus. The exploration
constant c controls the amount of exploration. Higher values will lead to more
exploration. We move to the child node with the lowest LCB value and repeat the
process until we reach a node that we can extend.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
5.4. monte carlo tree search 83
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
84 c h ap ter 5. falsification through planning
In the extend step, MCTS samples a disturbance and simulates the system one
step forward in time from the current node. It then estimates the value at the
new node and adds it to the tree. A common technique to estimate this value is
to perform rollouts from the new node and evaluate their robustness. We can
also estimate the value using a heuristic such as distance to failure. Finally, we
propagate this information back up the tree to update the visit counts and mean
value estimate for each node in the path. Figure 5.13 shows the result of using
MCTS to find failures in the continuum world problem. The algorithm gradually
expands the tree toward the obstacle and visits promising nodes more often.
The tree search algorithms we have presented so far assumed deterministic
transitions between nodes. In other words, simulating disturbance x from state s
will always lead to the same next state s0 . However, we may not have control over
all sources of randomness for some real-world simulators, resulting in stochastic
transitions between nodes. One advantage of MCTS is that it can handle this
stochasticity. A technique called double progressive widening can be used to extend
the tree in these cases. Double progressive widening applies the progressive
widening condition to both the disturbance and next state.14 14
A. Couëtoux, J.-B. Hoock, N.
Sokolovska, O. Teytaud, and N.
Bonnard, “Continuous Upper Con-
5.5 Reinforcement Learning fidence Trees,” in Learning and In-
telligent Optimization (LION), 2011.
Reinforcement learning algorithms train agents to perform a task while they interact
with an environment.15 We can use reinforcement learning for falsification by For an introduction to reinforce-
15
training an agent to cause a system to fail. To avoid confusing the reinforcement ment learning, see R. S. Sutton and
A. G. Barto, Reinforcement Learning:
learning agent with the agent in the system under test, we call the reinforcement An Introduction, Second Edition.
learning agent an adversary. MIT Press, 2018.
Figure 5.14 shows the overall setup. At each time step, the adversary interacts
with the system by selecting a disturbance x. The system then steps forward in
time and produces a reward r for the adversary related to the failure objective.
We refer to a series of these time steps as an episode. Reinforcement learning
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
5.6. simulator requirements 85
algorithms train the adversary to maximize reward using data gathered over a
System
series of episodes. Specifically, the adversary learns a policy padv (s) that maps
states to disturbances. Once the adversary is trained, we can use it to search for
failures by performing rollouts of the system using disturbances selected by the x r
adversary’s policy.
Similar to MCTS, reinforcement learning algorithms balance between explo- Adversary
ration and exploitation. The adversary explores by trying different disturbances
in each state, and exploits by selecting disturbances that are likely to lead to a Figure 5.14. Reinforcement learn-
failure. Typically, the adversary will explore more at the beginning of training to ing for falsification. We train an ad-
versary to select disturbances that
gather data that it can later on exploit. Reinforcement learning algorithms balance will cause a system to fail. The ad-
between these two objectives to maximize sample efficiency. Sample efficient algo- versary receives feedback in the
form of a reward signal.
rithms require as few episodes as possible to learn an effective policy. A number
of sample efficient reinforcement learning algorithms have been developed, and
we can use off-the-shelf implementations of them to efficiently find failures of
complex systems.16 16
Off-the-shelf reinforcement
Another advantage of a reinforcement learning approach is its ability to gener- learning packages provide im-
plementations of a variety of
alize. The shooting methods and tree search algorithms discussed in this chapter reinforcement learning algorithms.
all required a specific initial state from which to find a failure path. Using rein- For example, see the Crux.jl
package in the Julia ecosystem.
forcement learning to find failures removes this necessity. Because the adversary
learns a policy over the entire state space, we can perform a rollout from any
initial state to search for a failure. Example 5.5 demonstrates this result on the
continuum world problem using an off-the-shelf reinforcement learning package.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
86 c hap ter 5. falsification through planning
To apply the reinforcement learning algorithms implemented in the Crux.jl Example 5.5. Example of using re-
inforcement learning to find fail-
package to the continuum world problem, we need to define the following: ures in the continuum world prob-
initial_state_dist = Product([Distributions.Uniform(0, 10), lem. The plots show rollouts of the
Distributions.Uniform(0, 10)]) adversary policy starting from dif-
function interact(s, x, rng) ferent initial states after different
_, _, s′ = step(cw, s, Disturbance(0, x, 0)) numbers of training episodes. Fail-
r = Float32(robustness(s, ψ.formula) - robustness(s′, ψ.formula)) ure trajectories are highlighted in
norm(s′ - [4.5, 4.5]) < 0.5 ? r += 10.0 : nothing red. The adversary is able to find
return (sp=s′, r=r) failures from most initial states af-
end
ter 50,000 training episodes. For
more information on the solving
We first define an initial state distribution that covers the entire state space,
code, see the Crux.jl documenta-
allowing us to find failures starting from any state. The interact function tion.
defines how the adversary interacts with the system. Given a state s and a
disturbance x, the function simulates the system one step forward in time
and returns a tuple with the next state s′ and reward r. The random number
generator rng is a required input for Crux.jl but is not used in this case since
the function is deterministic.
The reward is based on the change in robustness for the current step. We
also add a large reward for reaching a failure state. With these definitions, we
can apply any of the reinforcement learning algorithms in the Crux.jl pack-
age to find failures. The plots show rollouts of the adversary policy starting
from different initial states after different numbers of training episodes using
an algorithm called Proximal Policy Optimization (PPO). The adversary is
able to find failures from most initial states after 50,000 training episodes.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
5.6. simulator requirements 87
Extend
tree search
multiple shooting
s, x s0 , c
step
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
88 c h ap ter 5. falsification through planning
5.7 Summary
• Planning algorithms account for the temporal aspect of the falsification problem
and break it into a series of smaller problems.
• Tree search algorithms search the space of possible trajectories as a tree and
iteratively grow the tree in search of a failure trajectory.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
6 Failure Distribution
While the falsification algorithms in the previous chapters search for single failure
events, it is often desirable to understand the distribution over failures for a
given system and specification. This distribution is difficult to quantify exactly for
many real-world systems. Instead, we can approximate the failure distribution by
drawing samples from it. This chapter discusses methods for sampling from the
failure distribution. We present two categories of sampling methods. First, we
discuss rejection sampling, which produces samples from a target distribution
by accepting or rejecting samples from a different distribution. We then present
Markov chain Monte Carlo (MCMC) methods. MCMC methods generate samples / y)
p(t | t 2
from a target distribution using a chain of correlated samples. We conclude with
a discussion of probabilistic programming, which allows us to scale MCMC
p(t )
methods to complex, high-dimensional systems.
The distribution over failures for a given system with specification y is represented Figure 6.1. The distribution over
failures for a simple system where
by the conditional probability p(t | t 2 / y). We can write this probability as
trajectories consist of only a single
state that is sampled from a normal
1{ t 6 2 y } p ( t ) distribution (black). A failure oc-
/ y) = R
p(t | t 2 (6.1)
1{t 62 y} p(t ) dt curs when the sampled state is less
than 1. The area of the shaded re-
where 1{·} is the indicator function and p(t ) is the probability density of the gion corresponds to the integral in
equation (6.1). The failure distribu-
nominal trajectory distribution for trajectory t. Figure 6.1 shows the failure distri- tion (red) is the probability density
bution for a simple system where trajectories consist of only a single state that is function of the nominal distribu-
sampled from a normal distribution. For most systems, the failure distribution is tion in the failure region scaled by
this value.
difficult to compute exactly because doing so requires solving the integral in the
denominator of equation (6.1) to compute the normalizing constant. The value of
90 c h ap ter 6. failure distribution
this integral corresponds to the probability of failure for the system. We discuss 1
For a detailed overview, see C. P.
Robert and G. Casella, Monte Carlo
methods to estimate this quantity in chapter 7.
Statistical Methods. Springer, 1999,
While we cannot compute the probability density of the failure distribution vol. 2.
exactly, we can use its unnormalized probability density p̄(t | t 2 / y) to draw
samples from it. The unnormalized probability density is given by
/ y ) = 1{ t 6 2 y } p ( t )
p̄(t | t 2 (6.2)
Computing this density for a given trajectory only requires determining whether
it is a failure trajectory and evaluating its probability density under the nominal
trajectory distribution. The rest of this chapter discusses several methods for
sampling from this unnormalized distribution.1 With enough samples, we can
implicitly represent the distribution over failures (see figure 6.2).
Figure 6.2. Distribution over fail-
ures for the grid world prob-
6.2 Rejection Sampling lem represented implicitly through
samples. The probability of slip-
ping is set to 0.8.
Rejection sampling produces samples from a complex target distribution by ac-
cepting or rejecting samples from a different distribution that is easier to sample
from. It is inspired by the idea of throwing darts uniformly at a rectangular dart
board that encloses the graph of the density of the target distribution. If we keep
only the darts that land inside the target density, we produce samples that are
distributed according to the target distribution (see figure 6.3).
Figure 6.3. Sampling from a
In the dart board example, we are using samples from a uniform distribution truncated normal distribution by
to produce samples from an arbitrary target density. The efficiency of this process throwing darts uniformly at a rect-
angular dart board that encloses
depends on the area of the dart board that lies outside the target distribution. the graph of its density function.
If there is a large area outside the target distribution, many of the darts will The samples on the bottom are ob-
be rejected, and we will require more darts to accurately represent the target tained by moving all of the darts
that land inside the target distribu-
distribution. One way to improve efficiency is to use a different dart board that tion to the bottom of the dart board.
more closely matches the shape of the target distribution. In other words, we may These samples are distributed ac-
cording to the target distribution.
want to draw samples from a different distribution that is still easy to sample
from but more closely matches the target distribution. We call this distribution a 2
In the dart board analogy, we can
proposal distribution. think of this acceptance criteria as a
Algorithm 6.1 implements the rejection sampling algorithm given a target two step process. First, we sample
the x-coordinate of the dart from
distribution with density function p̄(t ) and a proposal distribution with density the proposal distribution. Second,
function q(t ). At each iteration, we draw a sample t from the proposal distribution we select its y-coordinate randomly
and accept it with probability proportional to p̄(t )/(cq(t )).2 To ensure that between the bottom of the board
and cq(t ). If it falls under p(t ), it
the proposal distribution fully encloses the target distribution, we require that is accepted.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
6.2. rejection sampling 91
q(t ) > 0 whenever p̄(t ) > 0 and that c is selected such that p̄(t ) cq(t ) for all
t. The density function of the target distribution does not need to be normalized.
/ y)
p̄(t | t 2
To sample from the failure distribution, we use the unnormalized density in
equation (6.2) as the target density. A common choice for the proposal distribution 2 0 2
4 4
is the nominal trajectory distribution. To use this proposal, we must select a value
for c such that 1{t 62 y} p(t ) cp(t ). Selecting c = 1 satisfies this condition and / y)
p(t | t 2
causes the acceptance ratio to reduce to 1{t 62 y}. In other words, we will accept
a sample if it is a failure trajectory and reject it otherwise. Figure 6.4 shows an
4 2 0 2 4
example that uses the nominal trajectory distribution to sample from the failure t
distribution shown in figure 6.1.
Figure 6.4. Rejection sampling us-
If failures are unlikely under the nominal distribution, we will require many
ing the nominal trajectory distribu-
samples to produce a representative set of samples from the failure distribution. tion as the proposal distribution to
In this case, we may be able to improve efficiency by using domain knowledge to sample from the failure distribu-
tion shown in figure 6.1. The plot
select a proposal distribution that more closely matches the shape of the failure on the top shows the target den-
distribution. For example, failures occur at negative values in the simple system sity (red) and the proposal den-
sity (gray). Accepted samples are
shown in figure 6.1, so we may be able to improve efficiency by shifting the
highlighted in red. The plot on the
proposal distribution to the left. bottom shows a histogram of the
When we select the proposal distribution for rejection sampling, we must also accepted samples compared to the
density function of the failure dis-
select a value for c to ensure that the proposal distribution fully encloses the target tribution.
distribution for all t. Figure 6.5 shows an example that uses a shifted proposal
distribution to sample from the failure distribution shown in figure 6.1 for two
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
92 c h ap ter 6. failure distribution
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4
The top row shows the results for
/ y)
p(t | t 2
M = 1, which is a loose bound. The
bottom row shows the results for
M = 0.6065, which is the tightest
possible value for M. More sam-
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 ples are accepted using the tighter
value for M resulting in greater ef-
t t t
ficiency.
p(t )
Mq(t ) / y)
p̄(t | t 2
c = 0.6065
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4
/ y)
p(t | t 2
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4
t t t
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
6.2. rejection sampling 93
Suppose we want to use rejection sampling to sample failures from an in- Example 6.1. Example of the chal-
lenges of using rejection sampling
verted pendulum system where the standard deviation of the sensor noise for high-dimensional systems with
for each state variable is 0.1. From example 4.3, we know that failures are rare long time horizons. In this exam-
ple, we compute the tightest value
under the nominal trajectory distribution, so rejection sampling using the we can select for c based on domain
nominal trajectory distribution as a proposal will be inefficient. We also saw knowledge for the inverted pendu-
in example 4.3 that when we instead sampled trajectories from a distribution lum system and show that it is pro-
hibitively large.
where the standard deviation of the sensor noise was 0.15, we were able to
find failures. Therefore, we may want to use this distribution as a proposal
for rejection sampling.
We must then select a value for c such that
p(t ) cq(t )
d d
p(s1 ) ’ N ( xt | 0, (0.1)2 I ) cp(s1 ) ’ N ( xt | 0, (0.15)2 I )
t =1 t =1
d ✓ ◆
N ( xt | 0, 0.01I )
’ N ( xt | 0, 0.0225I )
c
t =1
where we assume that the initial state distribution is the same for the proposal
and target. The term in the product will be maximized when xt = [0, 0] for
all t. Plugging this result into the product and assuming a depth of 40, we
find that ✓ ◆
N (0 | 0, 0.01I ) 40
c
N (0 | 0, 0.0225I )
1.2226 ⇥ 1014 c
Therefore, the tightest value we can select for c is 1.2226 ⇥ 1014 . Using this
value, our acceptance probabilities end up being very small (on the order of
10 15 ), and rejection sampling is inefficient.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
94 c h ap ter 6. fai lure distribution
Markov chain Monte Carlo (MCMC) algorithms generate samples from a target
distribution by sampling from a Markov chain.3 A Markov chain is a sequence of 3
A detailed overview of MCMC
techniques is provided in C. P.
random variables where each variable depends only on the previous one. MCMC
Robert and G. Casella, Monte Carlo
algorithms begin by initializing a Markov chain with an initial sample t. At each Statistical Methods. Springer, 1999,
iteration, they use the current sample t to generate a new sample t 0 by sampling vol. 2.
6.3.1 Metropolis-Hastings
One of the most common MCMC algorithms is the Metropolis-Hastings algorithm.5 5
W. K. Hastings, “Monte Carlo
The Metropolis-Hastings algorithm accepts a new sample t 0 given the current Sampling Methods Using Markov
Chains and Their Applications,”
sample t with probability Biometrika, vol. 57, no. 1, pp. 97–97,
p̄(t 0 ) g(t | t 0 ) 1970.
(6.3)
p̄(t ) g(t 0 | t ) 6
When the kernel is symmet-
where p̄ is the unnormalized target density. To sample from the failure distribution, ric, the algorithm is called the
Metropolis algorithm: N. Metropo-
we set p̄ = 1{t 62 y} p(t ). Since we are taking a ratio of the densities, the target lis, A. W. Rosenbluth, M. N. Rosen-
density does not need to be normalized. The kernel g(· | t ) is often chosen to bluth, A. H. Teller, and E. Teller,
“Equation of State Calculations by
be a symmetric distribution, meaning that g(t 0 | t ) = g(t | t 0 ).6 In this case, Fast Computing Machines,” Journal
the acceptance criteria reduces to p̄(t 0 )/ p̄(t ). Intuitively, if t 0 is more likely than of Chemical Physics, vol. 21, no. 6,
t, it is always accepted. If t 0 is less likely than t, it is accepted with probability pp. 1087–1092, 1953. A common
choice of a symmetric kernel is a
proportional to the ratio of the densities. Gaussian distribution centered at
the previous sample.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
6.3. markov chain monte carlo 95
6.3.2 Smoothing
When we use algorithm 6.2 to sample from the failure distribution, we will not
accept any samples that are not failures because p̄(t ) = 1{t 62 y} p(t ) will be 0
for those samples. While this behavior is necessary for the algorithm to converge
to the failure distribution in the limit of infinite samples, it can create challenges
in practice. For example, if we initialize the Markov chain to a safe trajectory, the
algorithm will reject all samples from g(· | t ) until it samples a failure. Since
g(· | t ) typically produces trajectories similar to t, we may require many samples
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
96 c h ap ter 6. fai lure distribution
To define a Gaussian kernel for the inverted pendulum system, we must first Example 6.2. Example of a Gaus-
sian kernel for the inverted pendu-
define a trajectory distribution type (algorithm 4.3) for the pendulum. The lum system.
following code defines a trajectory distribution for the pendulum system
that uses a Gaussian distribution for the initial state and a vector of Gaussian
distributions for the observation disturbance distributions:
struct PendulumTrajectoryDistribution <: TrajectoryDistribution
μ₁ # mean of initial state distribution
Σ₁ # covariance of initial state distribution
μs # vector of means of length d
Σs # vector of covariances of length d
end
function initial_state_distribution(p::PendulumTrajectoryDistribution)
return MvNormal(p.μ₁, p.Σ₁)
end
function disturbance_distribution(p::PendulumTrajectoryDistribution, t)
D = DisturbanceDistribution((o)->Deterministic(),
(s,a)->Deterministic(),
(s)->MvNormal(p.μs[t], p.Σs[t]))
return D
end
depth(p::PendulumTrajectoryDistribution) = length(p.μs)
We can then define a kernel for the pendulum system that returns an instan-
tiation of this distribution as follows:
function inverted_pendulum_kernel(τ; Σ=0.01I)
μ₁ = τ[1].s
μs = [step.x.xo for step in τ]
return PendulumTrajectoryDistribution(μ₁, Σ, μs, [Σ for step in τ])
end
The new distribution is centered at the initial state and observation distur-
bances of the current sample. We can use this kernel with algorithm 6.2 to
sample from the failure distribution of the inverted pendulum system.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
6.3. markov chain monte carlo 97
before we sample a failure to accept, especially if t is far from the failure region.7 7
One way to avoid this behavior is
We see this behavior during the burn-in period in figure 6.6. to ensure that the initial trajectory
is a failure. The algorithms in chap-
Another challenge arises when the failure distribution has multiple modes. To ters 4 and 5 can be used to search
move between modes, the algorithm must sample a failure from one failure mode for an initial failure trajectory.
using a kernel conditioned on a trajectory from another. If the failure modes are
spread out in the trajectory space, the algorithm may require a large number of
samples before moving from one mode to another. Example 6.3 illustrates these
challenges on a simple Gaussian system.
Smoothing is a technique that addresses these challenges by modifying the target
density to make it easier to sample from.8 It relies on a notion of the distance 8
H. Delecki, A. Corso, and M. J.
to failure, which we will write as D(t ) for a given trajectory t. This distance Kochenderfer, “Model-Based Vali-
dation as Probabilistic Inference,”
is a nonnegative number that measures how close t is to a failure. For failure in Conference on Learning for Dynam-
trajectories, D(t ) should be 0. We can rewrite the target density in terms of this ics and Control (L4DC), 2023.
distance as
/ y ) = 1{ D ( t ) 0} p ( t )
p̄(t | t 2 (6.4)
The indicator function causes sharp boundaries between safe and unsafe trajecto-
ries. To create a smooth version of this density, we replace the indicator function
with a Gaussian distribution with mean 0 and a small standard deviation. The
resulting smoothed density is
e = 0.8
⇣ ⌘ e = 0.5
/ y) ⇡ N D(t ) | 0, e2 p(t )
p̄(t | t 2 (6.5) e = 0.2
no smoothing
where e is the standard deviation.
For systems with temporal logic specifications, we can specify the distance 4 2 0 2 4
function using temporal logic robustness (section 3.5.2). Since robustness is t
positive when the formula is satisfied and negative when it is violated, we can
Figure 6.7. Smoothed versions
write the distance function as of the failure distribution in fig-
ure 6.1 for different values of e. As
D(t ) = max(0, r(t )) (6.6) e decreases, the smoothed distribu-
tion approaches the failure distri-
bution.
where r(t ) is the robustness of the trajectory t. Figure 6.7 shows the smoothed
version of the failure distribution in figure 6.1 for different values of e. As e
approaches 0, the smoothed density approaches the shape of failure distribution.
As e approaches infinity, the smoothed density approaches the shape of the
nominal distribution.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
98 c hap ter 6. fai lure distribution
Suppose we want to sample from the failure distribution shown in the plot Example 6.3. Example of the chal-
lenges of using MCMC to sample
on the left and we initialize our Markov chain with t = 1. We will not accept from the failure distribution given
a new sample until we draw a sample with a value less than 1. If we use a a finite sample budget. The plot
on the left demonstrates the chal-
Gaussian kernel with standard deviation 1, we have that lenges with initialization, and the
⇣ ⌘ plot on the right shows the chal-
g(t 0 | t ) = N t 0 | 1, 12 lenges of sampling from failure dis-
tributions with multiple modes.
The probability of drawing a sample less than 1 from this distribution is
0.02275 (corresponding to the shaded region in the plot on the left). Therefore,
we will require 44 samples on average before the algorithm accepts a sample.
If we were to initialize the algorithm with a sample even further from the
failure region, we would require even more samples to the point where
MCMC may not converge within a finite sample budget.
The plot on the right demonstrates the challenge of using MCMC to sample
from a failure distribution with multiple modes. In this case, the current
sample is in the mode on the left at 2.2. Using the same Gaussian kernel,
we have ⇣ ⌘
g(t 0 | t ) = N t 0 | 2.2, 12
The probability of moving to the other mode from this point is 1.3346 ⇥ 10 5 .
Therefore, we will require a large number of samples before we switch modes.
g(t 0 | t )
g(t 0 | t )
/ y)
p̄(t | t 2
/ y)
p̄(t | t 2
t t
4 2 0 2 4 4 2 0 2 4
t t
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
6.3. markov chain monte carlo 99
algorithm toward different failure modes. Figure 6.9 compares the path taken by
a Gaussian kernel with the path taken by MALA on a simple target density. The
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
100 chap ter 6. failure distribution
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4
t t t
p
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4
t t t
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
6.3. markov chain monte carlo 101
The inverted pendulum system has two main failure modes: tipping over in Example 6.4. Applying smoothing
to sample from the failure distri-
the negative direction and tipping over in the positive direction. To observe bution of the inverted pendulum
the effect of smoothing on the performance of MCMC, we define the following system. Smoothing allows MCMC
two unnormalized target densities: to sample from both failure modes
p = NominalTrajectoryDistribution(inverted_pendulum, 21) # depth = 21 given a finite sample budget. The
p̄(τ) = isfailure(ψ, τ) * pdf(p, τ) plot below shows the result of run-
function p̄_smooth(τ; ϵ=0.15) ning MCMC without smoothing.
Δ = max(robustness([step.s for step in τ], ψ.formula), 0)
return pdf(Normal(0, ϵ), Δ) * pdf(p, τ) 1
end
q (rad)
The plot in the margin shows the results when we run algorithm 6.2 using p̄ 0
as the target density. We will not accept any samples that are not failures, and
we only observe failures from one failure mode. The plots below show the
1
results when we use p̄_smooth combined with rejection sampling. Smoothing 0 0.2 0.4 0.6 0.8
allows us to sample failures from both failure modes. However, we now draw Time (s)
some samples that are not failures during the MCMC (left), so we must reject
them after the algorithm has terminated to recover the failure distribution
(right).
1
q (rad)
1
0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8
Time (s) Time (s)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
102 chap ter 6. failure distribution
MALA kernel enables MCMC to move more efficiently toward regions of high
likelihood.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
6.4. probabilistic programming 103
Algorithm 6.3 writes the rollout function as a probabilistic program that can be 16
We use a probabilistic program-
used to sample from the smoothed failure distribution.16 Similar to algorithm 4.5, ming package written for the Julia
language called Turing.jl.
the probabilistic programming model samples an initial state from the initial
state distribution and steps the system forward in time by sampling from the
disturbance distribution at each time step. However, rather than explicitly drawing
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
104 chap ter 6. failure distribution
the samples, the model only specifies the distributions from which the samples
are drawn. The probabilistic programming tool handles the sampling and keeps
track of the probability associated with each draw automatically.
To specify that we want to sample failure trajectories, we add a log proba-
bility term for the smoothed indicator function in equation (6.5). Probabilistic
programming tools often perform operations in log space for numerical stability.
Adding this term in log space is equivalent to multiplying the target density by
the smoothed indicator function. Example 6.5 demonstrates how to use algo-
rithm 6.3 to sample from the failure distribution of the inverted pendulum system.
It runs the algorithm twice to produce two chains that capture two distinct failure
modes. In addition to smoothing, running multiple MCMC chains from different
starting points is another method to improve performance of MCMC for failure
distributions with multiple modes.
To use probabilistic programming to sample from the failure distribution of Example 6.5. Sampling from
the failure distribution of the in-
the inverted pendulum system, we can use the following code to set up the verted pendulum system using al-
MCMC algorithm and distance function: gorithm 6.3. The plot shows the re-
mcmc_alg = Turing.NUTS(10, 0.65, max_depth=6) sult of running the algorithm twice
Δ(𝐬) = max(robustness(𝐬, ψ.formula, w=1.0), 0) to produce two MCMC chains. The
initial samples that are not failures
The code sets up the No U-Turn Sampler (NUTS) MCMC algorithm. Since are discarded during the burn-in
NUTS relies on the gradient of the target density, we use smoothed robustness period.
in the distance function so that the gradient exists. The first two parame- 1
ters in the NUTS constructor are the number of adaptation steps and the Chain 1
target acceptance rate. The plot shows the result of running algorithm 6.3
q (rad) 0
with the specified parameters. We run the algorithm twice to produce two
MCMC chains. Running multiple chains from different starting points is an- Chain 2
6.5 Summary
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
6.5. summary 105
• Markov chain Monte Carlo (MCMC) algorithms sample from the target dis-
tribution by drawing samples from a Markov chain and scale well to high-
dimensional systems.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7 Failure Probability Estimation
After searching for the potential failure modes of a system, we may also want to
estimate its probability of failure. This chapter presents several techniques for
estimating this quantity from samples. We begin by discussing a direct estimation
approach that uses samples from the nominal trajectory distribution to estimate
the probability of failure. If failures are rare, this approach may be inefficient and
require a large number of samples to produce a good estimate. The remainder of
the chapter discusses more efficient estimation techniques based on importance
sampling. Importance sampling techniques artificially increase the likelihood of
failure trajectories by sampling from a proposal distribution. We discuss several
variations of importance sampling and conclude by presenting a nonparametric
algorithm that estimates the probability of failure from a sequence of samples.
The probability of failure for a given system and specification is defined mathe-
matically as
Z
1
Note that the right-hand side of
equation (7.1) is equivalent to the
pfail = E t ⇠ p(·) [1{t 62 y}] = 1{t 62 y} p(t ) dt (7.1) denominator in equation (6.1). In
other words, the probability of fail-
where 1{·} is the indicator function. The expectation is taken over the nominal ure is the normalizing constant for
the failure distribution.
trajectory distribution for the system.1 Given a set of m trajectories from this
distribution, we can produce an estimate p̂fail of the probability of failure by
treating the problem as a parameter learning problem, where where the parameter
of interest is the parameter of a Bernoulli distribution. We can then apply the
maximum likelihood or Bayesian methods from chapter 2 to calculate p̂fail .
108 chap ter 7. failure probability estim ation
where n is the number of samples that resulted in a failure and m is the total
number of samples. Algorithm 7.1 uses direct sampling to implement this estima-
tor. It performs m rollouts and computes the probability of failure according to
equation (7.2).
We can evaluate the accuracy of an estimator using metrics such as bias, consis-
tency, and variance (example 7.1). Equation (7.2) provides an empirical estimate
of the probability of failure by computing the sample mean of a set of samples
drawn from a Bernoulli distribution with parameter pfail . The sample mean is an
unbiased estimator of the true mean of a Bernoulli distribution, so the estimator is
unbiased. We can calculate the variance of this estimator by dividing the variance
of a Bernoulli distribution by the number of samples:
pfail (1 pfail )
Var[ p̂fail ] = (7.3)
m
The square root of this quantity is known as the standard error of the estimator. A
lower variance means that the sample mean will be closer to the true mean on
average and therefore indicates a more accurate estimator. In the limit of infinite
samples, the variance approaches zero, so the estimator is consistent.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.1. direct estimation 109
Bias, consistency, and variance are three common properties used to evaluate Example 7.1. Common metrics
used to evaluate estimators. The
the quality of an estimator. An estimator that produces p̂fail is unbiased if it plots show predictions of three dif-
predicts the true value in expectation: ferent estimators with shaded re-
gions to represent the variance.
E [ p̂fail ] = pfail
For example, given a set of samples drawn independently from the same
distribution, the sample mean is an unbiased and consistent estimator of
the distribution’s true mean. The variance of the estimator quantifies the
spread of the estimates around the true value. For the sample mean example,
the variance will decrease as the number of samples increases. The plots
below illustrate these concepts. The shaded regions reflect the variance of
the estimator.
high
variance
p̂fail
low
variance
m m m
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
110 chap ter 7. failure probability estim ation
We demonstrate the effect of equation (7.3) empirically by running 10 trials Example 7.2. The empirical mean
and variance of the direct estima-
of algorithm 7.1 on the grid world problem. We compute the empirical mean tor for the grid world problem
and variance of p̂fail across all 10 trials after each new sample. The plot below computed over 10 trials of algo-
rithm 7.1. The depth d is set to 50
shows the results of this experiment. and the probability of slipping is
2 set to 0.8. The blue line shows the
⇥10
2 mean of p̂fail for all 10 trials, and
the shaded region represents one
standard deviation above and be-
1.5
low the mean.
p̂fail
0.5
0
0 1,000 2,000 3,000 4,000 5,000
m
In addition to the number of samples, the true probability of failure pfail also
has an impact on the relative accuracy of the estimator. As the true probability
of failure decreases, the number of samples required to achieve a given level of
accuracy increases (see exercise 7.1). For systems in which failure events are rare,
we may require a large number of samples to produce an accurate estimate for
the probability of failure using algorithm 7.1. Section 7.2 introduces importance
sampling, which can be used to improve the efficiency in these scenarios.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.1. direct estimation 111
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
112 chap ter 7. failure probability estim ation
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.2. importance sampling 113
We are interested in the quantity p( pfail < d), which is the probability that the
true probability of failure is less than or equal to d. This quantity is given by the
cumulative distribution function of the posterior distribution over the probability
of failure.2 The quantiles of the posterior distribution can be used to compute 2
The cumulative distribution func-
confidence intervals in a similar manner. Example 7.3 demonstrates this process. tion of a Beta distribution is the
regularized incomplete beta func-
tion. Software packages such as
Distributions.jl provide imple-
7.2 Importance Sampling mentations of both the cumulative
distribution function and the quan-
Importance sampling algorithms increase the efficiency of sampling-based estima- tile function for the Beta distribu-
tion.
tion techniques. Instead of sampling from the nominal trajectory distribution
p, they sample from a proposal distribution q that assigns higher likelihood to
areas of greater ‘‘importance.’’3 To estimate the probability of failure using these 3
This proposal distribution has
similar properties to the proposal
samples, we must transform the expectation in equation (7.1) to an expectation
distribution introduced in sec-
over q: tion 6.2 for rejection sampling.
pfail = E t ⇠ p(·) [1{t 62 y}]
Z
= p(t )1{t 62 y}dt
Z
q(t )
= p(t ) 1{t 62 y}dt
q(t ) (7.7)
Z
p(t )
= q(t ) 1{t 62 y}dt
q(t )
p(t )
= E t ⇠q(·) 1{ t 6 2 y }
q(t )
For equation (7.7) to be valid, we require that q(t ) > 0 wherever p(t )1{t 62
y} > 0. This condition is satisfied as long as the proposal distribution assigns a
nonzero likelihood to all failure trajectories that are possible under p.
Given samples from q(·), we can estimate the probability of failure based on
equation (7.7) as
1 m p(ti )
(7.8)
m iÂ
p̂fail = 1{ti 62 y}
=1
q(ti )
Algorithm 7.3 implements this estimator. Equation (7.8) is an unbiased estimator
of the true probability of failure. It corresponds to a weighted average of samples
from the proposal distribution:
m
1
p̂fail = Â wi 1{ti 62 y} (7.9)
m i =1
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
114 chap ter 7. failure probability estim ation
Suppose that we run algorithm 7.2 on the collision avoidance problem with Example 7.3. Quantifying uncer-
tainty in the probability estimate
m = 100 samples and observe no failures. Assuming we begin with a uni- produced by algorithm 7.2. The
form prior, the posterior distribution over the probability over failure is plots show the posterior distribu-
Beta(1, 101). Suppose we are also given a safety requirement for the sys- tion Beta(1, 101). The shaded re-
tem stating that pfail must not exceed 0.01. We can compute p( pfail < 0.01) gion in the first plot represents the
from the cumulative distribution function of the beta distribution using the probability that the true probabil-
ity of failure is less than or equal
following code: to 0.01. The shaded region in the
using Distributions second plot shows the 95 % confi-
posterior = Beta(1, 101) dence bound.
confidence = cdf(posterior, 0.01)
p( pfail )
50 50
0.64 0.95
0 0
0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04
pfail pfail
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.2. importance sampling 115
where the weights are wi = p(ti )/q(ti ). These weights are sometimes referred to
as importance weights. Trajectories that are more likely under the nominal trajectory
distribution have higher importance weights.
p ( t )1{ t 2 / y}
q⇤ (t ) = (7.11)
pfail
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
116 chap ter 7. failure probability estim ation
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.2. importance sampling 117
Consider the simple Gaussian problem shown below where failures occurs Example 7.4. Performance com-
parison of two hand-designed pro-
at values less than 2 (red shaded region). The plot on the left shows two posal distributions for the simple
proposal distributions we could use for importance sampling. The first pro- Gaussian problem where failures
occur at values less than 2 (red re-
posal distribution q1 shifts the nominal distribution toward the failure region gion). The first plot shows the nom-
and assigns high likelihood to likely failure trajectories. The second proposal inal distribution and two possible
distribution q2 is shifted toward the failure region but it still does not assign proposal distributions. The second
plot shows the estimation error
high likelihood to likely failures. Therefore, we expect q1 to result in better for direct estimation compared to
estimates than q2 . the estimation error of importance
sampling for the two distributions.
The plot on the right shows the estimation error when performing impor-
tance sampling with each proposal distribution compared to direct estimation.
The shaded region represents the 90 % empirical confidence bounds on the
error. As expected, q1 results in a lower estimation error and a lower vari-
ance than q2 and direct estimation. Performing importance sampling with q2
results in worse performance than direct estimation.
0.06
q2 ( t )
Absolute Estimation Error
0.04
q1 ( t )
p(t )
0.02
0
4 2 0 2 4 0 200 400 600 800 1,000
t Number of Samples
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
118 chap ter 7. failure probability estim ation
where the weight for each sample is computed using only the proposal that was
used to generate it. This weighting scheme, which we refer to as standard MIS (s-
MIS), is most similar to the importance sampling estimator for a single proposal
distribution (equation (7.8)).
Instead of considering each proposal individually, we can also view the sam-
ples as if they were drawn in a deterministic order from a mixture distribution
composed of all proposal distributions. This paradigm leads to the deterministic
mixture weighting scheme (DM-MIS):
p(ti )
wi = 1 m
(7.14)
m  j=1 q j (ti )
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.3. adaptive importance sampling 119
t t t
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
120 chap ter 7. failure probability estim ation
Suppose we want to estimate the probability of failure for the two- Example 7.5. Performance com-
parison of importance sampling to
dimensional Gaussian system shown below. The nominal distribution is multiple importance sampling for
a multivariate Gaussian distribution with a mean at the center of the figure, a two-dimensional Gaussian prob-
lem. The plot on the left shows a
and the failure region is composed of the two shaded red regions. The plots single proposal distribution that
below show the log density of both distributions. can be used for importance sam-
pling on this problem, while the
Nominal Distribution Failure Distribution plot on the right shows a set of pro-
posal distributions that can be used
for MIS. The plot below shows
the estimation error for IS com-
pared to the estimation error of
MIS with the two different weight-
t2
t2
ing schemes.
⇥10 6
8
IS
Estimation Error
t1 t1 6 s-MIS
DM-MIS
4
Most of the probability mass for the failure distribution is concentrated in the
central corners of the two modes, and a good proposal distribution should 2
assign high likelihood in those areas. If we only use one multivariate Gaus- 0
0 500 1,000
sian proposal distribution, we need to select a wide distribution to ensure
Number of Samples
that it covers both failure modes (left). We can improve performance by
selecting multiple proposal distributions that together cover both failure
modes (right). The plot in the caption compares the performance of impor-
tance sampling (IS) to multiple importance sampling with the two different
weighting schemes. The shaded region represents the 90 % empirical confi-
dence bounds on the error. The DM-MIS weighting scheme results in better
performance than the s-MIS weighting scheme.
t2
t1 t1
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.3. adaptive importance sampling 121
density and failure density.6 In these cases, the weight for a given sample t drawn 6
Minimizing the cross entropy is
from the distribution q is equivalent to weighted maximum
likelihood estimation for distribu-
1{ t 2/ y} p(t )
w= (7.15) tions in the natural exponential
q(t ) family. The natural exponential
family includes many common dis-
where p is the nominal trajectory distribution. tributions such as the Gaussian, ge-
If failures are rare under the initial proposal distribution, it is possible that no ometric, exponential, categorical,
and Beta distributions.
samples will be failures, and the weights computed in equation (7.15) will all be
zero. To address this challenge, the cross entropy algorithm iteratively solves a
relaxed version of the problem that relies on an objective function f . Similar to
the objective functions introduced in section 4.5, the objective function should
assess how close a trajectory is to a failure.7 The objective value must be greater 7
For example, an objective func-
than zero for trajectories that are not failures and less than or equal to zero for tion for the aircraft collision avoid-
ance problem might output the
failure trajectories. For systems with temporal logic specifications, we can use the miss distance between the two air-
robustness as the objective function. craft.
We can rewrite the goal of the cross entropy method in terms of the objective
function as finding the set of parameters that minimizes the cross entropy between
the proposal distribution and p(t | f (t ) 0). For systems with rare failure
events, we gradually make progress toward this goal by solving a series of relaxed
problems where we instead minimize the cross entropy between the proposal
and p(t | f (t ) g) for a given threshold g > 0. The weights used in maximum
likelihood estimation for the relaxed problem are
t2
1{ f ( t ) g } p ( t )
w= (7.16)
q(t )
At each iteration, we select the threshold g based on our current set of samples to
t1
ensure that a fraction of the weights will be nonzero (figure 7.4). Figure 7.4. Threshold selection for
Algorithm 7.5 implements the cross entropy method. At each iteration, we a two-dimensional Gaussian prob-
lem with two failures modes. The
draw samples from the current proposal distribution and compute their objective red shaded region shows the fail-
values. We then select the threshold g as the highest objective value from a set ure region. None of the current
of elite samples. The elite samples are the melite samples with the lowest objective samples overlap with the failure re-
gion, so we relax the problem by
values. Since our ultimate goal is to approach the failure distribution, we ensure expanding to the blue region that
that the threshold does not become negative by clipping it at zero. Given this contains a desired fraction of the
samples. The blue samples are the
threshold, we compute the weights using equation (7.16) and fit a new proposal top 10 % of samples with the lowest
distribution to the samples. After repeating this process for a fixed number of objective values.
iterations, algorithm 7.5 performs importance sampling (algorithm 7.3) with the
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
122 chap ter 7. failure probability estim ation
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.3. adaptive importance sampling 123
using the samples produced across all iterations of the algorithm to estimate
the probability of failure. They keep track of the proposal distribution used to
generate the samples at each iteration and view the problem as an instance of
MIS. They produce an estimate using the weighting schemes from section 7.2.3. p(t )
It is important to note in this case, however, that the proposal for each iteration
depends on the previous proposal. Since the proposals are not independent from 4 2 0 2
one another, the DM-MIS weighting scheme will no longer be unbiased. t
The performance of the cross entropy algorithm is sensitive to the form of
Figure 7.6. The cross entropy
the proposal distribution. The algorithm may perform poorly if the proposal method for a one-dimensional
distribution is not expressive enough to adequately capture the shape of the Gaussian problem. The plot shows
the Gaussian proposal distribution
failure distribution. This behavior is particularly apparent for complex systems at each iteration of the algorithm
with high-dimensional, multimodal failure distributions. For example, if we with darker distributions repre-
select a Gaussian proposal distribution for a system with two failure modes, the senting later iterations. The distri-
butions start at the nominal distri-
algorithm will struggle to find a proposal distribution that captures both failure bution and gradually move toward
modes (figure 7.5). One solution is to use a mixture of Gaussians for multimodal the failure distribution (red).
failure distributions (figure 7.7), but this approach requires knowing the number
of failure modes in advance, which is often not possible in practice. In these cases,
an adaptive MIS approach such as population Monte Carlo may perform better.
4 2 0 2 4
7.3.2 Population Monte Carlo t
Population Monte Carlo (PMC) is an adaptive MIS algorithm that maintains a set, Figure 7.7. Example of a Gaussian
or population, of proposal distributions (algorithm 7.6).8 Figure 7.8 shows a single mixture model proposal distribu-
tion for a one-dimensional prob-
step of the algorithm. We begin with an initial population of m proposals that is
lem with two failure modes. The
spread across the space of proposal distributions. For example, we could use a proposal distribution is a mixture
set of multivariate Gaussian distributions with a fixed covariance and different of two Gaussians (blue) that ap-
proximates the multimodal failure
means. It is important to ensure that the initial population is sufficiently diverse distribution (red).
to capture all failure modes. 8
O. Cappé, A. Guillin, J.-M. Marin,
At each iteration, the algorithm draws a single sample from each proposal dis- and C. P. Robert, “Population
Monte Carlo,” Journal of Compu-
tribution in the population. It then computes a weight for each sample in the same tational and Graphical Statistics,
way the weights are computed for the cross entropy method (equation (7.15)). vol. 13, no. 4, pp. 907–929, 2004.
Samples in regions of high likelihood under the failure distribution will receive
higher weights. To adapt the proposal distributions, PMC uses the weights to
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
124 chap ter 7. failure probability estim ation
Initial Proposals Sampling Weighting Resampling Figure 7.8. One iteration of the
population Monte Carlo algorithm.
For the weighting step, gray sam-
ples have zero weight, and the size
of the blue samples is proportional
to their weight. The resampling
t2
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.4. sequential monte carlo 125
perform a resampling step. In this step, we redraw m samples from the population
of samples with probability proportional to their weights. We then reconstruct the
population of proposal distributions using the resulting samples. For example,
if we are using proposals in the form of multivariate Gaussian distributions, we
could create new proposals with the same fixed covariance and means centered
at each sample.
Over time, the population of proposal distributions should cover high likeli-
hood regions of the failure distribution. After a fixed number of iterations, we
perform MIS using the final population to estimate the probability of failure. We
can use either of the weighting schemes from section 7.2.3 to produce the estimate.
Similar to the cross entropy method, we could instead use the samples produced
during all iterations of the algorithm to estimate the probability of failure, noting
that the estimate in this case may no longer be unbiased.
Using multiple proposal distributions allows us to represent complex, multi-
modal failure distributions. However, the performance of PMC is still dependent
on the number of proposal distributions and their ability to cover the space of
possible proposals. If the number of proposal distributions is too small or the ini-
tial proposals are not sufficiently diverse, the algorithm may miss failures modes
and produce an inaccurate estimate. Furthermore, the stochastic nature of the
resampling procedure can lead to a loss of diversity in the proposal distributions
over time. For example, the proposals may collapse to a single failure mode or a
subset of the failure modes.
The sampling, weighting, and resampling components of algorithm 7.6 form the
basis for a more general framework used in the field of Bayesian inference called
sequential Monte Carlo (SMC).9 In SMC, we start with samples from the nominal 9
SMC is also known as particle fil-
trajectory distribution and gradually adapt these samples to move toward the tering in the context of state es-
timation. M. S. Arulampalam, S.
failure distribution. We then use the path of each sample to estimate the probability Maskell, N. Gordon, and T. Clapp,
of failure. “A Tutorial on Particle Filters for
Online Nonlinear/non-Gaussian
One way to adapt the samples in SMC is to move them through a sequence of Bayesian Tracking,” IEEE Transac-
intermediate distributions that gradually transition from the nominal distribu- tions on Signal Processing, vol. 50,
tion to the failure distribution. Specifically, we create a sequence of distributions no. 2, pp. 174–188, 2002.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
126 chap ter 7. failure probability estim ation
distributions. The first method uses the smoothing technique introduced in sec-
tion 6.3.2. We can move from the nominal distribution to the failure distribution by
gradually decreasing the value of the standard deviation e in the smoothed den-
sity.10 The second method uses the same thresholding technique used in the cross 10
A similar technique is the expo-
entropy method. The intermediate distributions take the form p(t | f (t ) g) nential tilting barrier presented in
A. Sinha, M. O’Kelly, R. Tedrake,
where f (t ) is the objective function and g is a threshold. We move from the and J. C. Duchi, “Neural Bridge
nominal distribution to the failure distribution by gradually decreasing the value Sampling for Evaluating Safety-
Critical Autonomous Systems,” Ad-
of g. vances in Neural Information Pro-
At each iteration of SMC, our goal is to transition samples from the current cessing Systems (NeurIPS), vol. 33,
distribution g` to the next distribution in the sequence g`+1 . We typically only pp. 6402–6416, 2020.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.4. sequential monte carlo 127
t1 t1
The accuracy of the estimator in equation (7.18) depends on how well the
samples at each iteration represent the corresponding intermediate distribution.
If the samples are not representative, the weights will be small, and the estimator
will be inaccurate. However, we may require a large number of MCMC steps
to transition samples from one distribution to the next, especially for samples
that are unlikely under the next distribution. One technique used to address this
challenge is to resample the trajectories based on their importance weights.12 12
P. Del Moral, A. Doucet, and
This step is similar to the resampling step in PMC and tends to result in better A. Jasra, “Sequential Monte Carlo
Samplers,” Journal of the Royal Sta-
coverage of the intermediate distributions (see example 7.6). After resampling, tistical Society Series B: Statistical
we reset the weights to the mean of the weights before resampling to ensure that Methodology, vol. 68, no. 3, pp. 411–
436, 2006.
the estimator in equation (7.18) remains accurate.
Algorithm 7.7 implements SMC given a nominal trajectory distribution and
a set of intermediate distributions. At each iteration, it perturbs the current set
of samples to represent the next distribution in the sequence using MCMC. Ex-
ample 7.7 provides an implementation of this step for the inverted pendulum
problem. The algorithm then updates the importance weights and performs the
resampling step. Finally, it returns an estimate of the probability of failure based
on equation (7.18).
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
128 chap ter 7. failure probability estim ation
Consider the scenario shown below in which we want to transition samples Example 7.6. The benefit of resam-
pling in SMC. The first set of plots
from the blue distribution to the purple distribution using 10 MCMC steps illustrates the resampling step, and
per sample. The plots below illustrate the weighting and resampling steps. the second set of plots shows the
improvement in the samples at the
The plot in the middle shows the weights of the samples, with darker points next iteration after performing re-
having higher weights. The plot on the right shows the samples after resam- sampling.
pling according to these weights. After resampling, the samples are more
representative of the purple distribution.
t1 t1 t1
On the next iteration, we perform MCMC starting at these samples with the
purple distribution as the target to complete the transition. The plots below
show the result of this step with and without resampling. The results without
resampling start the MCMC chains at the blue samples shown above. The
resampling step results in a set of samples that better represents the target
distribution.
Without Resampling With Resampling
t2
t1 t1
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.5. ratio of normaliz ing constants 129
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
130 chap ter 7. failure probability estim ation
To estimate the probability of failure for the inverted pendulum system using Example 7.7. Application of SMC
to the inverted pendulum problem.
SMC, we implement the following function that uses 10 MCMC steps to The plots show the samples from
transition samples between intermediate distributions: the intermediate smoothed failure
function perturb(samples, ḡ) distributions.
function inverted_pendulum_kernel(τ; Σ=0.05^2 * I)
μs, Σs = [step.x.xo for step in τ], [Σ for step in τ]
return PendulumTrajectoryDistribution(τ[1].s, Σ, μs, Σs)
end
k_max, m_burnin, m_skip, new_samples = 10, 1, 1, []
for sample in samples
alg = MCMCSampling(ḡ, inverted_pendulum_kernel, sample,
k_max, m_burnin, m_skip)
mcmc_samples = sample_failures(alg, inverted_pendulum, ψ)
push!(new_samples, mcmc_samples[end])
end
return new_samples
end
1
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Time (s) Time (s) Time (s) Time (s)
Using 1,000 samples per iteration, SMC estimates the probability of failure to
be approximately 0.0001. The direct estimate for probability of failure based
on one million simulations is approximately 0.0005.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.5. ratio of normaliz ing constants 131
R
z2 = ḡ2 (t ) dt, and our goal is to estimate the ratio of the normalizing constants
z1 /z2 using samples from g2 .
First, we rewite z1 in terms of an expectation over g2 :
Z
z1 = ḡ1 (t ) dt (7.19)
Z
g2 ( t )
= ḡ1 (t ) dt (7.20)
g2 ( t )
Z
ḡ (t )
= g2 ( t ) 1 dt (7.21)
ḡ2 (t )/z2
Z
ḡ (t )
= z 2 g2 ( t ) 1 dt (7.22)
ḡ2 (t )
ḡ (t )
= z2 E t ⇠ g2 (·) 1 (7.23)
ḡ2 (t )
Dividing both sides of equation (7.23) by z2 gives us the ratio of the normalizing
constants, which we can approximate using m samples from g2 :
z1 ḡ (t ) 1 m ḡ1 (ti )
= E t ⇠ g2 (·) 1 (7.24)
m iÂ
⇡
z2 ḡ2 (t ) ḡ (t )
=1 2 i
where ti ⇠ g2 (·) and g2 (t ) > 0 whenever g1 (t ) > 0. Note that the estimator in 15
We could also use equa-
equation (7.24) only requires evaluating the unnormalized densities ḡ1 (t ) and tion (7.24) to estimate the
reciprocal of the probability of
ḡ2 (t ). Since pfail is the normalizing constant of the failure distribution, we can failure using samples from the
use equation (7.24) to estimate the probability of failure by setting ḡ1 (t ) equal to failure distribution by setting
ḡ2 (t ) equal to the unnormalized
the unnormalized failure density and ḡ2 (t ) equal to any normalized proposal failure density and ḡ1 (t ) equal to
density q(t ). In fact, these choices of ḡ1 (t ) and ḡ2 (t ) cause equation (7.24) to any normalized density whose
reduce to the importance sampling estimator in equation (7.8) (see exercises 7.2 support is contained within the
support of failure distribution.
and 7.3).15 However, selecting ḡ1 (t ) to
If g1 and g2 have little overlap in terms of probability mass, the estimator in satisfy this condition is often
difficult in practice and can lead to
equation (7.24) may perform poorly. One technique to improve performance estimators with infinite variance.
is called umbrella sampling (also known as ratio importance sampling). Umbrella This technique is called reciprocal
sampling introduces a third density, called an umbrella density, that has signif- importance sampling. In general,
this estimator should not be used
icant overlap with both g1 and g2 . We use this density to estimate the ratio of for failure probability estimation.
normalizing constants by applying equation (7.24) twice:
h i
ḡ1 (t ) 1 m ḡ1 (ti )
m Âi =1 ḡu (ti )
z1 z1 /zu E t ⇠ gu (·) ḡu (t )
= = h i ⇡ (7.25)
z2 z2 /zu E
ḡ2 (t ) 1
Âm 2 i
ḡ (t )
t ⇠ gu (·) ḡu (t ) m i =1 ḡu (ti )
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
132 chap ter 7. failure probability estim ation
z1
ḡu⇤ (t ) µ ḡ1 (t ) ḡ2 (t ) (7.26)
z2
Similar to the optimal proposal for importance sampling, the optimal umbrella
density is expressed in terms of the quantity we are trying to estimate, so we
cannot compute it exactly. In general, we want to select an umbrella density that
is as close as possible to this density.
Another technique to estimate the ratio of normalizing constants when g1 and
g2 have little overlap is called bridge sampling. Similar to umbrella sampling, bridge
sampling introduces a third density called a bridge density. However, instead of
using samples from this density to estimate the ratio of normalizing constants,
bridge sampling uses samples from both g1 and g2 . Assuming we produce m1
samples from g1 and m2 samples from g2 , we again apply equation (7.24) twice
to obtain the bridge sampling estimator:
h i
ḡb (t ) 1 m2 ḡb (tj )
z1 zb /z2 E t ⇠ g (·) ḡ2 (t ) m2 Â j=1 ḡ2 (tj )
(7.27)
2
= = h i ⇡
z2 zb /z1 E
ḡb (t ) 1 m ḡ (t )
 1 b i
t ⇠ g1 (·) ḡ1 (t ) m1 i =1 ḡ1 (ti )
ḡ1 (t ) ḡ2 (t )
ḡb⇤ (t ) µ (7.28)
m1 ḡ1 (t ) + m2 zz12 ḡ2 (t )
which is again written in terms of the quantity we are trying to estimate. Given
samples from both g1 and g2 , we can use a simple iterative procedure to estimate
the optimal bridge density (algorithm 7.8). At each iteration, we apply equa-
tion (7.27) using the current bridge density to estimate the ratio of normalizing
constants. We then plug this ratio into equation (7.28) to obtain a new bridge
density. We repeat this process for a fixed number of iterations.
While umbrella sampling and bridge sampling both introduce a third density to
improve efficiency, they have different properties. For example, umbrella sampling
only requires samples from one density, while bridge sampling requires samples
from two different densities. Furthermore, the optimal umbrella density and the
optimal bridge density are very different (see figure 7.11). The optimal umbrella
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.5. ratio of normaliz ing constants 133
function bridge_sampling_estimator(g₁τs, ḡ₁, g₂τs, ḡ₂, ḡb) Algorithm 7.8. Algorithm for es-
ḡ₁s, ḡ₂s = ḡ₁.(g₁τs), ḡ₂.(g₂τs) timating the optimal bridge den-
ḡb₁s, ḡb₂s = ḡb.(g₁τs), ḡb.(g₂τs) sity ḡb using samples from ḡ₁
return mean(ḡb₂s ./ ḡ₂s) / mean(ḡb₁s ./ ḡ₁s) and ḡ₂. We iteratively apply equa-
end tion (7.27) to estimate the ratio of
normalizing constants and use this
function optimal_bridge(g₁τs, ḡ₁, g₂τs, ḡ₂, k_max) ratio to update the bridge density
ratio = 1.0 using equation (7.28).
m₁, m₂ = length(g₁τs), length(g₂τs)
ḡb(τ) = (ḡ₁(τ) * ḡ₂(τ)) / (m₁ * ḡ₁(τ) + m₂ * ratio * ḡ₂(τ))
for k in k_max
ratio = bridge_sampling_estimator(g₁τs, ḡ₁, g₂τs, ḡ₂, ḡb)
end
return ḡb
end
ḡu⇤ (t )
t t
density covers regions of high likelihood for both distributions, while the optimal
bridge density bridges the gap between the two distributions.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
134 chap ter 7. failure probability estim ation
where wi = p(ti )/q̄(ti ) and ti ⇠ q(·). This estimator (algorithm 7.9) is similar to
the estimator in equation (7.9) for normalized proposal distributions with the
extra step of dividing the unnormalized importance weights by their sum.
The optimal proposal for self-IS is different from the optimal proposal for
p(t )
importance sampling. Based on equation (7.26), the optimal proposal for self-IS
/ y)
p̄(t | t 2
is
q⇤ (t ) µ p(t ) |1{t 62 y} pfail | (7.30) q⇤ (t )
Sampling from this density should result in half of the samples coming from the t
failure distribution and half coming from the success distribution. The optimal
proposal for IS, on the other hand, is the failure distribution itself. Figure 7.12 Figure 7.12. The optimal pro-
posal for self-normalized impor-
shows the optimal proposal distribution for self-IS on a simple Gaussian system. tance sampling for a simple Gaus-
In practice, we can plug a guess for pfail into equation (7.30) to obtain a proposal sian problem with a failure thresh-
old of 1.
distribution that is close to the optimal proposal. However, drawing samples
from this proposal is often difficult in practice, especially for systems with rare
failure events and multiple failure modes (see example 7.8). Furthermore, the
performance of the algorithm tends to be sensitive to incorrect guesses for pfail
when creating the proposal distribution. Bridge sampling, which we discuss in
the next section, is less sensitive to these choices.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.5. ratio of normaliz ing constants 135
Suppose we want to use self-IS to estimate the probability of failure for the Example 7.8. The challenges asso-
ciated with sampling from the op-
simple Gaussian system. We know that the optimal proposal is of the form timal self-IS proposal density. The
plots show the proposal distribu-
q ⇤ ( t ) µ p ( t ) |1{ t 6 2 y } a| tion for three different failure prob-
abilities along with histograms of
samples drawn from these distri-
where a is our guess for the probability of failure. The plots below show the butions using MCMC. As the prob-
proposal distribution for three different values of a along with histograms of ability of failure decreases, the dis-
samples drawn from these distributions using MCMC. For each distribution, tributions become more difficult to
accurately sample from.
we use 5,100 MCMC steps with a burn-in of 100 steps, keeping every 10th
sample.
a = 10 2 a = 10 7 a = 10 12
10 5 0 5 10 10 5 0 5 10 10 5 0 5 10
t t t
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
136 chap ter 7. failure probability estim ation
where we draw m1 samples from g`+1 (·) and m2 samples from g` (·). The in-
termediate distributions in the chain should be chosen such that the ratio of
normalizing constants between two consecutive distributions is easy to estimate.
In other words, consecutive intermediate distributions should have significant
overlap with each other. For example, we can create the intermediate distributions
using either of the two methods in figure 7.9.17 17
If we use the thresholding tech-
Algorithm 7.10 implements bridge sampling estimation using a sequence of in- nique, the algorithm reduces to the
multilevel splitting algorithm pre-
termediate distributions. It begins by drawing samples from the nominal trajectory sented in section 7.6.
distribution. At each iteration, it perturbs the samples to match the next interme-
diate distribution and estimates the optimal bridge density using algorithm 7.8.
It then applies equation (7.32) to compute the ratio of normalizing constants
between the two distributions. Finally, the algorithm applies equation (7.31) to
compute an estimate for the probability of failure.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.5. ratio of normaliz ing constants 137
To estimate to probability of failure from equation (7.27), we set ḡ1 (t ) equal Example 7.9. Proof that a bridge
sampling estimator that sets ḡ1 (t )
to the unnormalized failure density and ḡ2 (t ) equal to the density of the to the unnormalized failure den-
nominal trajectory distribution: sity and ḡ2 (t ) to the nominal tra-
jectory distribution can perform no
m ḡb (tj ) better than the direct estimator for
1
pfail m2 Â j=21 p(tj ) estimating the probability of fail-
⇡ ure.
1 1 m
Âi=11
ḡb (ti )
m1 1{ti 62y} p(ti )
The optimal bridge density is zero for all samples that are not failures. Since
all the samples from the failure density will be failure samples, we have
m1 m1
1 ḡ (t ) 1 p(ti )
m1 Â 1{ti 62b yi} p(ti ) =
m1 Â p(ti )
=1
i =1 i =1
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
138 chap ter 7. failure probability estim ation
The perturb step in algorithm 7.10 produces samples from the next distribution
in the sequence and can be performed using the MCMC algorithms presented in
chapter 6. However, these distributions may be difficult to sample from, especially
as we get closer to the failure distribution. In practice, we can greatly increase
efficiency by using the samples from the previous distribution as a starting point
to produce samples from the next distribution. This process is similar to the
process used in SMC (algorithm 7.7), in which we weight and resample the
trajectories from the previous distribution before applying MCMC. The ability
to adapt samples from the previous distribution is another benefit of using a
sequence of intermediate distributions. Figure 7.13 shows the samples from the
intermediate distributions for the continuum world problem.
e = 0.1 e = 0.01 / y)
p(t | t 2
As long as the thresholds gradually decrease in a way that ensures that the condi-
tional probabilities remain large, we can efficiently estimate these intermediate
probabilities using direct estimation.
To ensure that the conditional probabilities remain large, it is common to se-
lect the thresholds adaptively.19 Algorithm 7.11 implements adaptive multilevel 19
F. Cérou and A. Guyader, “Adap-
splitting. Adaptive multilevel splitting begins by drawing samples from the nom- tive Multilevel Splitting for Rare
Event Analysis,” Stochastic Analy-
inal trajectory distribution. At each iteration, it computes the objective value for sis and Applications, vol. 25, no. 2,
each sample and selects a threshold g such that a fixed number of samples have pp. 417–443, 2007.
objective values less than g. It then uses this threshold and the current samples
to estimate p( f (t ) g` | f (t ) g` 1 ).
The algorithm produces the next set of samples by perturbing the current
samples to represent the distribution p(t | f (t ) g` ). As with SMC and bridge
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
140 chap ter 7. failure probability estim ation
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.7. summary 141
sampling, this step can be performed using the MCMC algorithms presented in
chapter 6. To improve the efficiency of the MCMC, we first resample by drawing
m samples uniformly from the elite samples.
To accurately estimate the probability of failure, the last iteration of the algo-
rithm must use a threshold of zero. Algorithm 7.11 iterates until the threshold
reaches zero, at which point all elite samples are failures. If we reach the maxi-
mum number of iterations before this criterion is met, the algorithm will force
the final threshold to be zero. However, if there are no failure samples in the final
iteration, the final conditional probability will be zero, causing the algorithm
to return an estimate of zero. Therefore, it is important to ensure that we allow
enough iterations for the algorithm to reach the final threshold.
Multilevel splitting is considered a nonparametric algorithm in that we estimate
the probability of failure without assuming a specific form for the conditional
distributions. This feature allows multilevel splitting to extend to systems with
complex, multimodal failure distributions. Furthermore, the adaptive nature
of algorithm 7.11 allows us to smoothly transition from the nominal trajectory
distribution to the failure distribution without specifying the intermediate dis-
tributions ahead of time. Figure 7.14 shows an example of adaptive multilevel
splitting applied to the inverted pendulum system, which has two modes in its
failure distribution.
7.7 Summary
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
142 chap ter 7. failure probability estim ation
• For systems with rare failure events, we can use importance sampling to es-
timate the probability of failure by sampling from a proposal distribution
assigns higher likelihood to failure trajectories.
7.8 Exercises
Exercise 7.1. The coefficient of variation of a random variable is defined as the ratio of
the standard deviation to the mean and is a measure of relative variability. Compute the
coefficient of variation for the estimator in equation (7.2). For a fixed sample size m, how
does the coefficient of variation change as pfail increases? For a fixed pfail , how does the
coefficient of variation change as the sample size m increases?
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
7.8. exercises 143
For a fixed m, the coefficient of variation will decrease as the true probability of failure
pfail increases. For a fixed pfail , the coefficient of variation will decrease as the sample size
m increases. The plots here show an example of these relationships.
Coefficient of Variation
0.8 8
0.6 6
0.4 4
0.2 2
0 0
0 0.2 0.4 0.6 0.8 1 0 100 200
pfail m
Exercise 7.2. Show that equation (7.24) reduces to equation (7.2) when q̄1 (t ) = 1{t 62
y} p(t ) and q̄2 (t ) = p(t ).
Solution: Since q̄1 is the unnormalized failure distribution, its normalizing constant is the
probability of failure (z1 = pfail ). Since q̄2 is the normalized nominal distribution, its
normalizing constant is z2 = 1. Plugging these values into equation (7.24) gives
pfail 1{ t 6 2 y } p ( t )
= E t ⇠ p(·)
1 p(t )
pfail = E t ⇠ p(·) [1{t 62 y}]
Given samples from the nominal distribution ti ⇠ p(·), we can approximate the above
equation as
1 m
m iÂ
p̂fail = 1{ti 62 y}
=1
Exercise 7.3. Show that equation (7.24) reduces to equation (7.8) when q̄1 (t ) = 1{t 62
y} p(t ) and q̄2 (t ) = q(t ).
Solution: Since q̄1 is the unnormalized failure distribution, its normalizing constant is
the probability of failure (z1 = pfail ). Since q̄2 is a normalized proposal distribution, its
normalizing constant is z2 = 1. Plugging these values into equation (7.24) gives
pfail 1{ t 6 2 y } p ( t )
= E t ⇠ p(·)
1 q(t )
p(t )
pfail = E t ⇠ p(·) 1{ t 6 2 y }
q(t )
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
144 chap ter 7. failure probability estim ation
Given samples from the proposal distribution ti ⇠ q(·), we can approximate the above
equation as
1 m p(ti )
m iÂ
p̂fail = 1{ti 62 y}
=1
q(ti )
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
8 Reachability for Linear Systems
Forward reachability algorithms compute the set of states a system could reach
over a given time horizon. To perform this analysis, we need to make some
assumptions about the initial state and disturbances for the system. In the previous
chapters, we sampled initial states and disturbances from probability distributions, 1
One way to convert a probability
often with support over the entire real line. However, to perform reachability distribution to a bounded set is to
use the support of the distribution.
computations, we need to restrict the initial states and disturbances to bounded If the support of the distribution
sets.1 We assume that the initial state comes from a bounded set S and that the spans the entire real line, we can
select a region that contains most
disturbances at each time step come from a bounded set X . The disturbance set of the probability mass.
146 chap ter 8. re achability for linear systems
X is defined as follows:
82 3 9
< xa
> >
=
6 7
X = 4 xo 5 x a 2 X a , xo 2 X o , xs 2 X s (8.1)
>
: x >
;
s
where X a , Xs , and Xo are the disturbance sets for the agent, environment, and
sensor, respectively. 0.5
Given an initial state s and a disturbance trajectory x1:d = (x1 , . . . , xd ) with
depth d, we can compute the state of the system at time step d by performing a
v (m/s)
0
rollout (algorithm 4.6) and taking the final state. We denote this operation as
sd = Reach(s, x1:d ). By performing this operation on various initial states and 0.5
disturbances sampled from S and X , we find a set of points in the state space
that the system could reach at time step d. Figure 8.1 demonstrates this process 0.4 0.2 0 0.2 0.4
p (m)
on the mass-spring-damper system.
We define the reachable set at depth d as the set of all states that the system Figure 8.1. Samples from R5 for
could reach at time step d given all possible initial states and disturbances. We the mass-spring-damper system
with initial position between 0.2
write this set as
and 0.2 and initial velocity set to
zero. The disturbance sets for the
Rd = {sd | sd = Reach(s, x1:d ), s 2 S , xt 2 Xt , t 2 1 : d} (8.2) observation noise are bounded be-
tween 1 and 1. The gray points
where Xt represents the set of possible disturbances at time step t. We are often represent the initial states, the gray
lines show the trajectories, and the
interested in the full set of states that the system might reach in a given time blue points represent the states af-
horizon rather than at a specific depth d. We denote this set as R1:h and represent ter 5 time steps.
it as the union of the reachable sets at each depth up to the time horizon:
h
[
R1:h = Rd (8.3)
d =1
Figure 8.2 shows the reachable sets in R1:4 for the mass-spring-damper system.
Computing reachable sets allows us to understand the behavior of a system
over time. For example, we can use reachable sets to determine if a system remains
within a safe region of the state space.2 We call the set of states that make up 2
We could also determine if the
the unsafe region of the state space the avoid set and use this set to define a system reaches a goal region in the
state space. In this case, we would
specification for the system (algorithm 8.1). If the reachable set intersects with want to check if the reachable set is
the avoid set, the system violates the specification. contained within the goal region.
The reachability algorithms we discuss in this chapter apply to linear systems.
Linear systems are a class of systems for which the sensor, agent, and environment
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
8.2. set propagation techniques 147
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
148 chap ter 8. re achability for linear sy stems
A common example of a linear system is a mass-spring-damper system (see Example 8.1. A common exam-
ple of a linear system. The mass-
diagram in caption), which can be used to model mechanical systems such as spring-damper system consists of
a car suspension or a bridge. The state of the system is the position (relative a mass m attached to a wall by a
spring with spring constant k and a
to the resting point) p and velocity v of the mass (s = [ p, v]), the action is damper with damping coefficient c.
the force b applied to the mass, and the observation is a noisy measurement The system is controlled by a force
of state. The equations of motion for a mass-spring-damper system are b applied to the mass. The plots
show simulated trajectories of the
system for different levels of obser-
p0 = p + (v)Dt vation noise. With enough noise,
✓ ◆ the system becomes unstable.
0 k c 1
v = v+ p v + b Dt
m m m k
b
where m is the mass, k is the spring constant, c is the damping coefficient, m
and Dt is the discrete time step. Rewriting the dynamics in the form of
c
equation (8.6), we have
" #" # " #
1 Dt p 0
T (s, a, xs ) = k c + 1 b + xs = Ts s + T a a + xs
m Dt 1 m Dt v m Dt
0.1 xo 0.1 1 xo 1 10 xo 10
0.4
0.2
p (m)
0.2
0.4
0 2 4 0 2 4 0 2 4
Time (s) Time (s) Time (s)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
8.2. set propagation techniques 149
Minkowski Sum
first focus on set propagation for the one step reachability problem. We assume
we are given a set of initial states S and a set of disturbances X . Our goal is to
compute the set of states S 0 that the system could reach at the next time step.
Given a single initial state s and disturbance x, we can compute the next state s0
by applying equations (8.4) to (8.6) sequentially:
We can compute the reachable set at the next time step by applying equation (8.9)
to S and X . To perform this computation, we must define the operations in
equation (8.9) as set operations. In particular, we must be able to apply a linear
transformation, or matrix multiplication, to a set and add two sets together.
The multiplication of a set P by a matrix A is defined as
AP = {Ap | p 2 P } (8.10)
where the result is the set of all points obtained by multiplying each point in P
by A. The addition of two sets P and Q is defined as
P Q = {p + q | p 2 P , q 2 Q} (8.11)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
150 chap ter 8. re achability for linear systems
where the result is the set of all points obtained by adding each point in P to
each point in Q. This operation is referred to as the Minkowski sum of two sets
and is often denoted using the symbol.4 Figure 8.3 shows these operations in 4
The Minkowski sum is named af-
two-dimensional space. As we will discuss in the next section, we can efficiently ter Polish mathematician Hermann
Minkowski (1864–1909).
compute linear transformations and Minkowski sums for many common set types
such as hyperrectangles and polytopes.5 5
The LazySets.jl package in Julia
With these definitions in place, we can rewrite equation (8.9) using set opera- provides implementations of these
operations for many common sets.
tions as
S 0 = (Ts + Ta Po Os )S Ta Po Xo Ta X a Xs (8.12)
where S 0 is the one step reachable set. It is important that we simplify the system
dynamics into the form of equation (8.9) before applying set operations. If we
apply the equations without simplification, we may encounter a phenomenon
called the dependency effect, which occurs when a variable appears more than once
in a formula. Set operations fail to model this dependency, leading to conservative
reachable sets (see example 8.2). Algorithm 8.2 implements equation (8.12).
To compute reachable sets over a given time horizon using set propagation
techniques, we rely on the fact that the reachable set at time step d is a function of
the reachable set at time step d 1. Specifically, we can compute the reachable set
at time step d by applying equation (8.12) to the reachable set at time step d 1:
Algorithm 8.3 implements this recursive algorithm for computing the reachable
set at each time step. The algorithm terminates when it reaches the desired time
horizon h and returns R1:h .
In addition to gaining insight into the behavior of a system, we can use reach-
able sets to verify that a system satisfies a given specification. For a given spec-
ification, we want to ensure that the reachable set does not intersect with its
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
8.2. set propagation techniques 151
Consider a simple system with the following component models: Example 8.2. Example of the de-
pendency effect on a simple sys-
tem.
O(s, xo ) = s
p (o, xa ) = Io
T (s, a, xs ) = s + a
where the state, action, and observation are two-dimensional and I is the
identity matrix. Suppose we want to compute the one-step reachable set S 0
when the initial set is a square centered at the origin with side length 1. If we
apply the sensor, agent, and environment models on the initial set without
simplification, we get O = S , A = IO = IS , and S 0 = S A = S IS .
The resulting set S IS is a square with side length 2 centered at the origin.
However, if we first simplify before switching to set operations, we get that
s0 = s s = 0. Thus, the true reachable set contains only the origin. The
plots below show this result.
1 1 1
2 1 1 2 2 1 1 2 2 1 1 2
1 1 1
2 2 2
This mismatch is due to an effect called the dependency effect, which leads
to conservative reachable sets. Because applying the set operations in order
does not account for the fact that the action depends on the state, it considers
worst-case behavior. For this reason, it is important to simplify the system
models into the form of equation (8.9) before applying set operations to avoid
unnecessary conservativeness. While this simplification is always possible
for linear systems, it is not always possible for the nonlinear systems we
discuss in the next chapter.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
152 chap ter 8. re achability for linear sy stems
Suppose we want to compute R1:20 for the mass-spring-damper system with Example 8.3. Computing the
reachable sets for the mass-spring-
initial position between 0.2 and 0.2 and initial velocity set to zero. We damper system over a time horizon
assume the observation noise is bounded between 0.2 and 0.2. To perform of 20 steps. The reachable sets are
shown below, switching from light
reachability, we must implement the following functions for the system: blue to dark blue over time.
𝒮₁(env::MassSpringDamper) = Hyperrectangle(low=[-0.2,0], high=[0.2,0])
function disturbance_set(sys)
Do = sys.sensor.Do 0.4
low = [support(d).lb for d in Do.v]
high = [support(d).ub for d in Do.v] 0.2
v (m/s)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
8.2. set propagation techniques 153
0.1 xo 0.1 1.0 xo 1.0 2.5 xo 2.5 Figure 8.4. Reachable sets (bottom
row) for the mass-spring-damper
system with varying levels of ob-
0.5 servation noise compared to sam-
ples from a finite number of tra-
v (m/s)
0.4 0.2 0 0.2 0.4 0.4 0.2 0 0.2 0.4 0.4 0.2 0 0.2 0.4
p (m) p (m) p (m)
Specifically, if at any point in algorithm 8.3 we find that Rd ✓ Rd 1 (figure 8.5), this property for convex sets. It
is also the case that if Rd ✓
we can conclude that Rd is an invariant set, meaning that the system will stay R1:d 1 , then R1:d is an invariant set.
within this set indefinitely. 6 However, this property is generally
more difficult to check since it re-
quires checking whether a set is a
subset of a nonconvex set.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
154 chap ter 8. reachability for linear sy stems
To ensure that algorithms 8.2 to 8.4 are tractable, we must select set representations
that are computationally efficient. Desirable properties include:
• Finite representations: We should be able to specify the points that are contained
in the set without needing to enumerate all of them.
• Closure under set operations: A set representation is closed under a particular set
operation if applying the operation results in a set of the same type.
In this chapter, we will focus on convex set representations, which tend to have
these properties.7 A convex set is a set for which a line drawn between any two 7
Some nonconvex sets can also be
points in the set is contained entirely within the set. Mathematically, a set P is efficiently represented and manip-
ulated. A detailed overview is pro-
convex if we have vided in M. Althoff, G. Frehse, and
ap + (1 a)q 2 P (8.15) A. Girard, “Set Propagation Tech-
niques for Reachability Analysis,”
for all p, q 2 P and a 2 [0, 1]. Figure 8.6 illustrates this property. The rest of this Annual Review of Control, Robotics,
and Autonomous Systems, vol. 4,
section discusses a common convex set representation called polytopes.
pp. 369–395, 2021.
8.3.1 Polytopes
A polytope is defined as the bounded intersection of a set of linear inequalities.8 A 8
We can also define convex sets
linear inequality has the form a> x b where a is a vector of coefficients, x is a such as ellipsoids using nonlinear
inequalities. O. Maler, “Computing
vector of variables, and b is a scalar. We refer to the set of points that satisfy a given Reachable Sets: An Introduction,”
linear inequality as a half space. A polyhedron is the intersection of a finite number French National Center of Scientific
Research, pp. 1–8, 2008.
of half spaces. If the polyhedron is bounded, we call it a polytope. Figure 8.7
illustrates these concepts in two dimensions.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
8.3. set representations 155
a3
such that Âin=1 li = 1 and li 0 for all i. Intuitively, the convex hull of a set of
points is the smallest convex set that contains all the points (figure 8.8).
It is always possible to convert between the two polytope representations; how-
ever, the calculation is nontrivial.9 Each representation has different advantages. 9
A detailed overview is provided
For example, H-polytopes are more efficient for checking whether a point belongs in G. M. Ziegler, Lectures on Poly-
topes. Springer Science & Business
to the set because we can simply check if it satisfies all the linear inequalities. In Media, 2012, vol. 152. In Julia,
contrast, V -polytopes are more efficient for set operations such as linear trans- LazySets.jl provides functional-
ity to convert between the two rep-
formations. To compute a linear transformation of a polytope represented as a resentations.
V -polytope, we can apply the transformation to each vertex to obtain the vertices
of the transformed polytope.
The Minkowski sum of two V -polytopes is
P1 P2 = conv({v1 + v2 | v1 2 V1 , v2 2 V2 }) (8.17)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
156 chap ter 8. reachability for linear systems
P1
of all pairs of vertices from the two polytopes. To determine which candidates are
actual vertices, we must determine which candidate vertices are on the boundary
of the convex hull. Figure 8.9 illustrates this process.
We can use these results to reason about the complexity of algorithm 8.3 if
we were to represent our sets as polytopes. We apply equation (8.12) at each
iteration, which involves four linear transformations and three Minkowski sums.
The number of candidate vertices resulting from computing the one step reachable
set using equation (8.12) is |S1 ||Xo ||X a ||Xs | where |P | represents the number of
vertices in polytope P . The number of candidate vertices for the reachable set
at depth d is then |S1 |(|Xo ||X a ||Xs |)d . We can prune the candidate vertices that
are not actual vertices by computing the convex hull of the candidate vertices,
but this operation can be expensive. 10 Therefore, the exponential growth in the 10
The most efficient algorithms for
number of candidate vertices creates tractability challenges for high-dimensional computing the vertices of the con-
vex hull of a set of points have
systems with long time horizons.11 a complexity of O(mv) where m
is the number of candidate ver-
tices and v is the number of ac-
8.3.2 Zonotopes tual vertices. In general, the num-
ber of actual vertices grows super-
A zonotope is a special type of polytope that avoids the exponential growth in linearly. For more details, see R. Sei-
candidate vertices for Minkowski sums. It is defined as the Minkowski sum of a del, “Convex Hull Computations,”
in Handbook of Discrete and Com-
set of line segments centered at a point c: putational Geometry, Chapman and
Hall, 2017, pp. 687–703.
m 11
Other polytope representations
Z = {c + Â ai gi | ai 2 [ 1, 1]} (8.18)
such as the Z -representation and
i =1
M-representation perform Min-
kowski sums more efficiently. More
where g1:m are referred to as the generators of the zonotope.12 We represent zono-
details are provided in S. Sigl
topes by a center point and list of generators: and M. Althoff, “M-Representation
of Polytopes,” ArXiv:2303.05173,
Z = (c, hg1:m i) (8.19) 2023.
12
Zonotopes can also be viewed as
linear transformations of the unit
hypercube.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
8.3. set representations 157
To compute the Minkowski sum of two zonotopes, we sum the centers and con-
catenate the generators:
Z Z 0 = (c + c0 , hg1:m , g1:m
0
0 i) (8.21)
Note that the number of generators in the resulting zonotope grows linearly
with the number of generators in each zonotope. Therefore, if we represent our
sets as zonotopes, the number of generators for the reachable set at depth d is
|S1 | + d|Xo ||X a ||Xs |. This linear growth represents a significant improvement
over the exponential growth in candidate vertices for generic polytopes.
8.3.3 Hyperrectangles
A hyperrectangle is a generalization of a rectangle to higher dimensions (fig-
ure 8.12). It is a special type of zonotope in which the generators are aligned with Polytopes
the axes. We may also work with linear transformations of hyperrectangles, which
Zonotopes
can always be transformed back to an axis-aligned representation. All hyperrect-
angles are zonotopes, and all zonotopes are polytopes; however, the reverse does
Hyperrectangles
not hold (figure 8.11). Hyperrectangles can be compactly represented as a center
point and a vector of half-widths. They can also be represented as a set of intervals
with one for each dimension. Unlike zonotopes, hyperrectangles are not closed
Figure 8.11. Zonotopes are a sub-
under linear transformations and Minkowski sums. class of polytopes, and hyperrect-
angles are a subclass of zonotopes.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
158 chap ter 8. reachability for linear sy stems
As noted in section 8.3.1, the number of candidate vertices for the reachable
sets in algorithm 8.3 grows exponentially with the time horizon and causes
computational challenges for high-dimensional systems. There are multiple ways
to reduce this computational burden. One way is to represent the initial state and
disturbance sets using zonotopes since the number of generators scales linearly
with the time horizon (see section 8.3.2). In this section, we will discuss another
technique to reduce the computational cost that relies on overapproximation.
The set P̃ represents an overapproximation of the set P if P ✓ P̃ . Typically, we
select the overapproximated set P̃ such that it is easier to compute or represent.
For example, we can use overapproximation to reduce the computational cost of
algorithm 8.3 by overapproximating the reachable set at each iteration with a set
that has fewer vertices (figure 8.13). We can then use this overapproximated set
as the initial set for the next iteration.
As long as the overapproximated reachable set does not intersect with the avoid
set, we can still use to it make claims about the safety of the system. However, if
the overapproximated reachable set does intersect with the avoid set, the results Figure 8.13. Overapproximating
are inconclusive. The violation could be due to unsafe behavior or the overap- the blue polytope with the purple
polytope. The purple polytope has
proximation itself. In this case, we could move to a tighter overapproximation or fewer vertices.
use a different method to verify safety (example 8.4).
Algorithm 8.5 modifies algorithm 8.3 to include overapproximation. Depend-
ing on the complexity of the reachable sets, we may not need to overapproximate
at every iteration, so we set a frequency parameter to control how often we overap-
proximate. Figure 8.14 demonstrates this idea on the mass-spring-damper system.
A more frequent overapproximation will result in greater computational efficiency
at the cost of extra overapproximation error in the reachable sets. We define over-
approximation error as the difference in volume between the overapproximated
reachable set and the true reachable set. 13
The Hausdorff distance is named
The overapproximation tolerance e places a bound on the Hausdorff distance after German mathematician Felix
between the overapproximated set and the original set.13 The Hausdorff distance Hausdorff (1868–1942).
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
8.4. reducing computational cost 159
between two sets P and P̃ is the maximum distance from a point in P to the
nearest point in P̃ . A lower value for e results in a less conservative overap-
proximation but may require more computation and result in a more complex
representation. The rest of this section discusses a technique for computing this
overapproximation.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
160 chap ter 8. reachability for linear sy stems
Suppose we want to determine if the mass-spring-damper system could Example 8.4. The effect of overap-
proximation on accuracy and com-
reach the avoid set within 40 time steps. To reduce computational cost, we use putational cost for the mass-spring-
algorithm 8.5 with an overapproximation frequency of 5 time steps. The plots damper system. The plots show
the reachable sets R1:40 using three
below show the reachable set R1:40 using three different overapproximation different overapproximation toler-
tolerances e. The plot on the right shows the number of vertices in Rd for ances. The plot below shows the
each depth d. number of vertices in the reachable
sets at each depth. If the tolerance
is too high, the reachable set may
e=0 e = 0.001 e=1 overlaps with the avoid set, and the
analysis is inconclusive.
0.5
100
Number of Vertices in Rd
v (m/s)
0 e=0
80 e = 0.001
60 e=1
0.5
40
0.4 0.2 0 0.2 0.4 0.4 0.2 0 0.2 0.4 0.4 0.2 0 0.2 0.4
20
p (m) p (m) p (m)
0
0 10 20 30 40
The first tolerance of e = 0 results in no overapproximation, but the number of Depth (d)
vertices grows quickly. The highest tolerance of e = 1 results in significantly
fewer vertices, but it is too conservative to the point where the reachable set
overlaps with the avoid set. Therefore, the results of the analysis with e = 1
are inconclusive. The middle tolerance of e = 0.001 strikes a balance between
the two extremes. With this tolerance, we are still able to verify safety while
reducing the computational cost.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
8.4. reducing computational cost 161
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
162 chap ter 8. reachability for linear systems
3. Compute the distance between each facet of the inner approximation and the
nearest vertex of the outer approximation.
4. Add the direction of the face that is furthest from the nearest vertex to D and
return to step 1.
The process is repeated until the maximum distance between the inner and outer
approximations is less than a specified tolerance e. Figure 8.18 shows the steps
involved in a single iteration of the algorithm, and figure 8.19 demonstrates the
process over multiple iterations.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
8.5. linear programming 163
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
164 chap ter 8. re achability for linear systems
and constraints are all linear. The linear program for equation (8.26) is
minimize d> sd
s1:d ,x1:d
subject to s1 2 S
(8.27)
xt 2 Xt for all t 2 1 : d
st+1 = Step(st , xt ) for all t 2 1 : d 1
where
Step(s, x) = (Ts + Ta Po Os )s + Ta Po xo + Ta xa + xs (8.28)
The decision variables in equation (8.27) are the state and disturbances at each
time step. The constraints enforce that the state and disturbances are within
their respective sets and that the state evolves according to equation (8.9). The
optimization problem in equation (8.27) can be solved efficiently using a variety
of algorithms.17 17
Modern linear programming
For the optimization problem in equation (8.27) to be a linear program, the solvers can solve problems with
thousands of variables and
sets S and Xt must be polytopes. We can write them as a set of linear inequalities constraints. H. Karloff, Linear
using their H-polytope representations. Algorithm 8.6 implements the linear Programming. Springer, 2008.
program for computing the support function of a reachable set at a particular
depth d. Given a desired time horizon h and a set of directions D , we can compute
an overapproximation of R1:h by evaluating the support function at each direction
for each depth. Algorithm 8.7 implements this process.
Similar to the polytope overapproximation in section 8.4, the choice of the
directions in D affects the tightness of the reachable set overapproximation. We
could select the directions to align with the axes or use more sophisticated meth-
ods like the iterative refinement algorithm in section 8.4.2. Since linear program
solvers are computationally efficient, another option is to simply evaluate the
support function at many randomly sampled directions. We could also select the 18
H. Abdi and L. J. Williams, “Prin-
directions using trajectory samples. Given a set of samples from the reachable set, cipal Component Analysis,” Wi-
ley Interdisciplinary Reviews: Com-
we can use principal component analysis (PCA)18 to determine the directions putational Statistics, vol. 2, no. 4,
that best capture the shape of the set.19 pp. 433–459, 2010.
The overapproximate reachable sets improve our understanding of the be-
19
O. Stursberg and B. H. Krogh,
“Efficient Representation and Com-
havior of the system. However, if our ultimate goal is to check intersection with putation of Reachable Sets for Hy-
a convex avoid set U , we can solve the problem exactly without the need for brid Systems,” in Hybrid Systems:
Computation and Control, 2003.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
8.5. linear programming 165
function ρ(model, 𝐝, d)
𝐬 = model.obj_dict[:𝐬]
@objective(model, Max, 𝐝' * 𝐬[:, d])
optimize!(model)
return objective_value(model)
end
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
166 chap ter 8. re achability for linear systems
minimize k sd uk
s1:d ,x1:D
subject to u2U
s1 2 S (8.29)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
8.6. summ ar y 167
8.6 Summary
• We can compute reachable sets for linear systems by propagating sets through
the system dynamics.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
168 chap ter 8. reachability for linear sy stems
The avoid set for the mass-spring-damper system can be written as the Example 8.5. Checking whether
the mass-spring-damper system
union of two convex sets. Specifically, we require that | p| < 0.3. The first can reach the avoid set using con-
set is therefore represented by the linear inequality [1, 0]> s 0.3, and the vex programming.
second set is represented by the linear inequality [ 1, 0]> s 0.3. To check
whether the system could reach the avoid set, we run algorithm 8.8 for each
component of the avoid set. The system does not satisfy the specification if
the algorithm returns false for either component.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
8.6. summary 169
• If the number of vertices in the reachable set grows too large, we can produce
overapproximate representations by evaluating the support function on a set
of directions.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
9 Reachability for Nonlinear Systems
This chapter extends the set propagation and optimization techniques discussed
in chapter 8 to perform reachability on nonlinear systems. A system is nonlin-
ear if its agent, environment, or sensor model contains nonlinear functions. The
reachable sets of nonlinear systems are often nonconvex and difficult to compute
exactly. This chapter begins by discussing several set propagation techniques
for nonlinear systems that overapproximate the reachable set.1 We then discuss 1
For more details on set propaga-
optimization-based nonlinear reachability methods. To minimize the overap- tion through nonlinear systems, re-
fer to M. Althoff, G. Frehse, and
proximation error introduced by these methods, we introduce a technique for A. Girard, “Set Propagation Tech-
overapproximation error reduction that involves partitioning the state space. We niques for Reachability Analysis,”
Annual Review of Control, Robotics,
conclude by discussing reachability techniques for nonlinear systems represented and Autonomous Systems, vol. 4,
by a neural network. pp. 369–395, 2021.
For nonlinear systems, the reachability function r (s, x1:d ) is a nonlinear function.
In contrast with the linear systems in chapter 8, we cannot directly propagate
arbitrary polytopes through nonlinear systems. We can, however, propagate
hyperrectangular sets2 using a technique called interval arithmetic.3 Interval arith- 2
We can also propagate sets that
metic extends traditional arithmetic operations and other elementary functions are linear transformations of hyper-
rectangles by reversing the linear
to intervals. An interval is a set of real numbers written as transformation to obtain an axis-
aligned hyperrectangle and per-
[ x ] = [ x, x ] = { x | x x x } (9.1) forming the analysis in the trans-
formed space.
where x and x are the lower and upper bounds of the interval, respectively. A 3
L. Jaulin, M. Kieffer, O. Didrit,
and É. Walter, Interval Analysis.
hyperrectangle, also known as an interval box, is the Cartesian product of a set of
Springer, 2001.
n intervals:
[x] = [ x1 ] ⇥ [ x2 ] · · · ⇥ [ x n ] (9.2)
172 chap ter 9. re achability for nonlinear systems
[ x2 ]
[x] = [ x1 ] ⇥ [ x2 ]
[ x ] [y] = { x y | x 2 [ x ], y 2 [y]} (9.3)
f ([ x ]) = [{ f ( x ) | x 2 [ x ]}] (9.8)
where the [·] operation takes the interval hull of the resulting set. The interval
hull of a set is the smallest interval that contains the set. Therefore, the interval
counterpart of a function returns the smallest interval that contains all possible
function evaluations of the points in the input interval.
We can define an interval counterpart for a variety of elementary functions.4 4
IntervalArithmetic.jl defines
For monotonically increasing functions such as exp, log, and square root, the the interval counterpart of many
elementary functions such as sin,
interval counterpart is cos, exp, and log in Julia.
f ([ x ]) = [ f ( x ), f ( x )] (9.9)
The interval counterpart for monotonically decreasing functions is similarly de-
fined. Nonmonotonic elementary functions such as sin, cos, and square require
multiple cases to define their interval counterparts. For example, the interval
counterpart for the square function is
8
<[min( x2 , x2 ), max( x2 , x2 )] if 0 2
/ [x]
[ x ]2 = (9.10)
:[0, max( x2 , x2 )] otherwise
Figure 9.2 shows example evaluations of the interval counterparts for the exp,
square, and sin functions.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
9.2. inclusion functions 173
6
f (x)
0
4 2
2 1
0 0
0 1 2 2 1 0 1 2 2 0 2
x x x
For complex functions, it is not always possible to define a tight interval counter-
part. In these cases, we instead define an inclusion function. An inclusion function
[ f ]([ x ]) outputs an interval that is guaranteed to contain the interval from the
interval counterpart:
f ([ x ]) ✓ [ f ]([ x ]) (9.11)
In other words, inclusion functions output overapproximate intervals. We can
also define an inclusion function for multivariate functions that map from R k to
R where k 1.
For reachability analysis, our goal is to propagate intervals through the function
r (s, x1:d ), which maps its inputs to R n where n is the dimension of the state space.
We can rewrite r (s, x1:d ) as a vector of functions that map to R as follows:
2 3
r1 (s, x1:d )
6 .. 7
s0 = r (s, x1:d ) = 6
4 .
7
5 (9.12)
rn (s, x1:d )
where ri (s, x1:d ) outputs the value of the ith component of s0 . We can then define
the inclusion function for each ri (s, x1:d ) as [ri ]([s], [x1:d ]). By evaluating each
inclusion function for the input intervals [s] and [x1:d ], we obtain an overapproxi-
mate hyperrectangular reachable set. The rest of this section discusses techniques
to create these inclusion functions.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
174 chap ter 9. reachability for nonlinear systems
f (x)
0
sion function is known as a natural inclusion function. For example, the natural
inclusion function for f ( x ) = x sin( x ) is [ f ]([ x ]) = [ x ] sin([ x ]) (figure 9.3). 2
By replacing the elementary nonlinear components of the agent, environment,
2 1 0 1 2
and sensor models with their interval counterparts, we can create the natural
x
inclusion function for ri (s, x1:d ). We can then use interval arithmetic to propagate
hyperrectangular sets through the natural inclusion function. This computation Figure 9.3. Example evaluation of
the natural inclusion function for
will result in overapproximate reachable sets for nonlinear systems. Algorithm 9.1 f ( x ) = x sin( x ). The inclusion
implements the natural inclusion reachability algorithm and computes over- function produces an overapproxi-
mate interval.
approximate reachable sets up to a desired time horizon. Example 9.1 applies
algorithm 9.1 to the inverted pendulum problem.
As shown in figure 9.3 and example 9.1, natural inclusion functions tend to
be overly conservative. This property is due to the dependency effect, in which
multiple occurrences of the same variable are treated independently (see ex-
ample 8.2). In chapter 8, we were able to eliminate this effect by simplifying
equations to algebraically combine all repeated instances of a variable. However,
this simplification is not always possible for nonlinear functions such as the one
shown in figure 9.3. We can instead mitigate the dependency effect by using more
f (x)
sophisticated techniques for generating inclusion functions, which we discuss in
the remainder of this section.
In other words, there exists a point in [ x ] where the slope of the tangent line
is equal to the slope of the secant line between the endpoints of the interval
(figure 9.4).
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
9.2. inclusion functions 175
Suppose we want to compute reachable sets for the pendulum system with Example 9.1. Computing reach-
able sets for the inverted pendulum
bounded sensor noise on the angle and angular velocity using algorithm 9.1. system using its natural inclusion
We define the intervals and extract functions as follows: function. The plot shows the over-
approximated reachable set R2
function intervals(sys, d) computed using algorithm 9.1 com-
disturbance_mag = 0.01 pared to a set of samples from R2 .
θmin, θmax = -π/16, π/16
ωmin, ωmax = -1.0, 1.0
𝐈 = [interval(θmin, θmax), interval(ωmin, ωmax)] 2
for i in 1:2d
w (rad/s)
push!(𝐈, interval(-disturbance_mag, disturbance_mag))
end 0
return 𝐈
end
function extract(env::InvertedPendulum, x) 2
s = x[1:2]
𝐱 = [Disturbance(0, 0, x[i:i+1]) for i in 3:2:length(x)] 1 0 1
return s, 𝐱 q (rad)
end
The intervals function returns the initial state intervals followed by the
disturbance intervals for each time step. The extract function extracts these
intervals into the state and disturbance components. The plot in the caption
shows the overapproximated reachable set for after two time steps.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
176 chap ter 9. reachability for nonlinear systems
The mean value theorem implies that for any subinterval of [ x ], there exists a f (x)
point in [ x ] where the slope of the tangent line is equal to the slope of the secant
line between the endpoints of the subinterval. Therefore, given the center c of the
interval [ x ], there exists a point x 0 2 [ x ] such that
f (x) f (c)
= f 0 ( x0 ) (9.14) f (x)
x c
x c x0 x x
for any x 2 [ x ] (figure 9.5). Rearranging equation (9.14) gives
Figure 9.5. For a given subinter-
f ( x ) = f (c) + f 0 ( x 0 )( x c) (9.15) val [c, x ], there exists a point in [ x ]
where the slope of the tangent line
is equal to the slope of the secant
Because we know that x 0 2 [ x ], we can use equation (9.15) to create an inclusion
line between the endpoints of the
function for f ( x ) as follows: subinterval.
the natural inclusion function for f 0 ( x ). For multivariate functions, equation (9.16)
f (x)
generalizes to 0
the input interval to include nonlinear regions, the mean value inclusion function
f (x)
2
9.2.3 Taylor Inclusion Functions
2 1 0 1 2
Natural inclusion functions and mean value inclusion functions are special cases x
of a more general type of inclusion function known as a Taylor inclusion function.
Figure 9.7. Mean value inclusion
These inclusion functions use Taylor series expansions about the center of the
function for f ( x ) = x sin( x ) over
a wider interval.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
9.2. inclusion functions 177
Natural Inclusion First Order Second Order Third Order Figure 9.8. Evaluation of Taylor
inclusion functions of different or-
ders for f ( x ) = x sin( x ) over
2
the interval [ x ] = [ 1, 1] (top row)
and [ x ] = [ 1.5, 1.5] (bottom row).
f (x)
imations (figure 9.8). However, the benefit of using higher-order terms depends
on the behavior of the function over the input interval. If the function is nearly
linear over the input interval, moving beyond a first-order model may not be
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
178 chap ter 9. re achability for nonlinear systems
worth the additional computational cost. In contrast, if the function is highly non-
linear over the input interval, a higher-order model may significantly decrease
overapproximation error.
Algorithm 9.2 implements first- and second-order Taylor inclusion functions
for reachability analysis. The algorithm computes overapproximate reachable sets
up to a desired time horizon by evaluating the Taylor inclusion function for each 7
Because Taylor inclusion func-
subfunction ri (s, x1:d ). Taylor inclusion functions can be used to create tighter tions can only be applied to func-
tions that are continuous and dif-
overapproximations of the reachable set than natural inclusion functions, espe- ferentiable, we use a modified ver-
cially for short time horizons (figure 9.9).7 However, the nonlinearities compound sion of the pendulum problem
in this chapter that does not ap-
for each time step, so Taylor models can be computationally expensive and result ply clamping in the environment
in significant overapproximation error for long time horizons (example 9.2). model.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
9.2. inclusion functions 179
Natural Inclusion First Order Second Order Figure 9.9. Comparison of the one-
step overapproximated reachable
1 sets for the inverted pendulum sys-
tem using natural, first-order Tay-
lor, and second-order Taylor inclu-
w (rad/s)
The plots below show the overapproximate reachable sets for the inverted Example 9.2. Overapproximate
reachable sets for the inverted pen-
pendulum system produced by a first-order Taylor inclusion function at dif- dulum system using first-order
ferent depths. As the depth increases, the overapproximation error increases. Taylor inclusion functions at dif-
ferent depths. As the depth in-
This result is due to the increasing presence of nonlinearities in the system creases, the overapproximation er-
dynamics as we increase the depth. For the one-step reachable set (R2 ), the ror increases.
only nonlinearity present is the sine function in the pendulum dynamics.
As the depth increases, this nonlinearity will be repeated for each time step,
leading to larger overapproximation error.
R2 R3 R4 R5
1
w (rad/s)
1
1 0 1 1 0 1 1 0 1 1 0 1
q (rad) q (rad) q (rad) q (rad)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
180 chap ter 9. re achability for nonlinear systems
While inclusion functions only operate over interval inputs and output reachable
sets in the form of hyperrectangles, Taylor models operate over other types of input
sets and are able to represent more expressive reachable sets.8 Similar to Taylor 8
K. Makino and M. Berz, “Taylor
inclusion functions, Taylor models are based on Taylor series expansions. An Models and Other Validated Func-
tional Inclusion Methods,” Interna-
nth-order Taylor model is a set represented as tional Journal of Pure and Applied
Mathematics, vol. 4, no. 4, pp. 379–
T = { p(x) + [↵] | x 2 X , ↵ 2 [↵]} (9.20) 456, 2003.
f 00 (c) f (n 1) ( x )
p( x ) = f (c) + f 0 (c)( x c) + (x c )2 + · · · + (x c )(n 1)
2! (n 1) !
(9.21)
where c is the center of the input interval. The interval remainder term, also
9
One way to handle this noncon-
known as the Lagrange remainder, bounds the sum of the rest of the terms in the
vexity is to represent sets using an
Taylor expansion over the input interval [ x ] so that the Taylor model is guaranteed extension of zonotopes called poly-
to contain the true output of the function. It is calculated as nomial zonotopes. More details can
be found in M. Althoff, “Reachabil-
[ f (n) ]([ x ]) ity Analysis of Nonlinear Systems
[a] = ([ x ] c)n (9.22) Using Conservative Polynomializa-
n! tion and Non-Convex Sets,” in In-
and is equivalent to the last term in a Taylor inclusion function of order n. In fact, ternational Conference on Hybrid Sys-
tems: Computation and Control, 2013.
passing an interval through a Taylor model performs the same computation as a Another representation called star
Taylor inclusion function of the same order. sets can also be used to repre-
sent nonconvex sets and has been
As the order of a Taylor model increases, overapproximation error tends to used for reachability. H.-D. Tran,
decrease (figure 9.10). Producing a zero-order Taylor model is equivalent to evalu- D. Manzanas Lopez, P. Musau, X.
ating the natural inclusion function, while producing a first-order Taylor model is Yang, L. V. Nguyen, W. Xiang, and
T. T. Johnson, “Star-Based Reacha-
equivalent to evaluating the mean value inclusion function. Taylor models begin bility Analysis of Deep Neural Net-
to deviate from inclusion functions for orders of two or higher. Second-order works,” in International Symposium
on Formal Methods, 2019.
Taylor models represent arbitrary polytopes, while second-order inclusion func-
tions only produce hyperrectangles. Higher-order Taylor models correspond to 10
M. Althoff, O. Stursberg, and
nonconvex sets, which are more difficult to understand and manipulate.9 For this M. Buss, “Reachability Analysis
of Nonlinear Systems with Uncer-
reason, we focus the remainder of this section on second-order Taylor models. tain Parameters Using Conserva-
Creating a second-order Taylor model for a function f (x) is a process known as tive Linearization,” in IEEE Confer-
ence on Decision and Control (CDC),
conservative linearization.10 Given an input set X and a center point c, the second- 2008.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
9.3. taylor models 181
Zero Order First Order Second Order Third Order Figure 9.10. Taylor models of dif-
ferent orders for f ( x ) = x sin( x )
1 over the interval [ x ] = [ 1.5, 0.0].
The dashed purple lines show re-
0 sults from a Taylor inclusion func-
f (x)
2
2 1 0 2 1 0 2 1 0 2 1 0
where J is the Jacobian of f evaluated at c and [↵] is the interval remainder term.
The Jacobian is a generalization of the gradient to functions with multidimensional
outputs and is computed as
2 3
r f 1 (c) >
6 .. 7
J=6 4 .
7
5 (9.24)
r f n (c) >
where r f i (c) is the gradient of the ith component of f evaluated at c. The interval
remainder term is calculated using interval arithmetic as
1
[↵] = ([X ] c)> [r2 f ]([X ])([X ] c) (9.25)
2
where [X ] is the interval hull of X .11 11
If the input set X is represented
Equation (9.23) represents a linear approximation of the nonlinear function f as a zonotope, it is also possible
to overapproximate the remainder
with a remainder term that bounds the error of the approximation. Because all of term directly without taking the in-
the operations in equation (9.23) are linear, we can use it to propagate convex terval hull. This approach can re-
duce overapproximation error. M.
sets. In other words, if X is convex, we can rewrite the Taylor model in terms of Althoff, O. Stursberg, and M. Buss,
linear transformations and Minkowski sums as “Reachability Analysis of Nonlin-
ear Systems with Uncertain Pa-
T = f (c) + J(X c) [↵] (9.26) rameters Using Conservative Lin-
earization,” in IEEE Conference on
Decision and Control (CDC), 2008.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
182 chap ter 9. re achability for nonlinear systems
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
9.4. concrete reachability 183
Suppose we want to compute reachable sets for the pendulum system with Example 9.3. Computing the one-
step reachable set for the inverted
bounded sensor noise on the angle and angular velocity using algorithm 9.3. pendulum system using conserva-
We define the sets function as follows: tive linearization. Conservative lin-
earization better approximates the
function sets(sys, d) reachable set than a second-order
disturbance_mag = 0.01 Taylor inclusion function.
θmin, θmax = -π/16, π/16
ωmin, ωmax = -1.0, 1.0
low = [θmin, ωmin]
high = [θmax, ωmax]
for i in 1:d
append!(low, [-disturbance_mag, -disturbance_mag])
append!(high, [disturbance_mag, disturbance_mag])
end
return Hyperrectangle(low=low, high=high)
end
The sets function returns the initial state set followed by the disturbance sets
for each time step. The plots below compare the one-step reachable set pro-
duced by conservative linearization with the set produced by a second-order
Taylor inclusion function. While conservative linearization still produces an
overapproximation, it captures the shape of the true reachable set better than
a Taylor inclusion function.
1
w (rad/s)
1
1 0 1 1 0 1
q (rad) q (rad)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
184 chap ter 9. re achability for nonlinear systems
r (s1 , x1:3 ) R3
the inverted pendulum system.
1
The symbolic reachability algo-
0.9 rithm directly computes R3 with-
out explicity computing R2 by con-
sidering r (s1 , x1:3 ) as a single func-
0.5 0 0.5 1
tion. The concrete reachability al-
gorithm computes R2 and R3 sep-
R1
arately by considering r (s1 , x1:2 )
R2 R3 and r (s2 , x2:3 ) as separate func-
Concrete
input dimension causes the size of the gradient and Hessian to increase, leading
to more expensive computations. Furthermore, the nonlinearities in the agent,
environment, and sensor models compound over time, causing the accuracy of a
linearized model to degrade as the depth increases.
Concrete reachability algorithms address these issues by decomposing the reach-
ability function into a sequence of simpler functions. Instead of overapproximating
the reachable set over the entire depth at once, they compute the overapproximate
reachable set for each time step individually. At each iteration, they use the over-
approximate reachable set from the previous time step as the input set for the next
time step. We refer to this process as concrete reachability because we concretize
the reachable set at each time step by explicitly computing an overapproximate
representation. In contrast, the algorithms presented thus far maintain a sym-
bolic representation of the reachable set at each time step and only concretize the
reachable set at depth d. For this reason, we refer to these algorithms as symbolic
reachability algorithms. Figure 9.11 illustrates the difference between symbolic and
concrete reachability algorithms.
Algorithms 9.4 and 9.5 implement concrete versions of the symbolic reach-
ability algorithms presented in algorithms 9.2 and 9.3, respectively. For each
depth in the time horizon, they compute the overapproximate reachable set for
the next step using the overapproximate reachable set from the previous step.
Algorithm 9.4 concretizes the reachable set into a hyperrectangle at each time
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
9.4. concrete reachability 185
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
186 chap ter 9. reachability for nonlinear systems
step, while algorithm 9.5 concretizes the reachable set into a polytope at each
time step.
Concrete reachability algorithms are generally more computationally efficient
than symbolic reachability algorithms. However, it is not always clear whether
they will produce tighter overapproximations because there are multiple factors
that contribute to the overapproximation error. The only source of overapproxima-
tion error in symbolic reachability algorithms is the error introduced by linearizing
the reachability function and bounding the remainder term. We expect this lin-
earization error to be smaller for concrete reachability algorithms because they
linearize over a single time step rather than the entire time horizon.
While concrete reachability algorithms reduce overapproximation error due to
linearization, they introduce additional overapproximation error by concretizing
the reachable set at each time step into an overapproximate reachable set (fig-
ure 9.11). This error compounds over time, and the accumulation of this error is
often referred to as the wrapping effect.
The decrease in linearization error and introduction of the wrapping effect
for concrete reachability algorithms result in a tradeoff between concrete and
symbolic reachability (figures 9.12 and 9.13). The choice of which type of al-
gorithm to use depends on the specific system, the reachability algorithm, and
the desired tradeoff between computational efficiency and overapproximation
error. For example, if we are using linearized models for reachability and the
one-step reachability function is nearly linear, concrete reachability algorithms
may produce tighter overapproximations than symbolic reachability algorithms.
It is common to mix concrete and symbolic reachability algorithms to take advan-
tage of the strengths of each approach. For example, instead of concretizing the
reachable set at each time step, we can concretize the reachable set every k time
steps to reduce the wrapping effect.
Another benefit of using concrete reachability algorithms is that we can use
them to check for invariant sets. Similar to the check for invariance described for
the set propagation techniques in section 8.2, if we find that the reachable set at
a given time step in contained within the concrete reachable set at the previous
time step, we can conclude that the reachable set is invariant. For example, the
concrete versions of R6 in figures 9.12 and 9.13 are contained within the concrete
versions of R5 . Therefore, we can conclude that R6 is an invariant set in both
cases, meaning that the system will remain within the set for all future time steps.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
9.5. optimiz ation-based nonlinear reachability 187
proximation.
proximations.
Similar to the ideas in section 8.5, we can overapproximate the reachable set of
nonlinear systems by sampling the support function. For symbolic reachability,
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
188 chap ter 9. re achability for nonlinear systems
minimize d> sd
sd ,x1:d
subject to s1 2 S
(9.27)
xt 2 Xt for all t 2 1 : d
sd = r (s1 , x1:d )
For concrete reachability, we replace the last constraint with a constraint for each
time step as follows:
minimize d> sd
sd ,x1:d ,↵
subject to s1 2 S
xt 2 Xt for all t 2 1 : d
" # (9.29)
s1 sc
sd = r (sc , xc ) + J +↵
x1:d xc
↵ 2 [↵]
where sc and xc are the centers of the state and disturbance sets and J is the
Jacobian of the reachability function evaluated at sc and xc . We introduce another
decision variable ↵ to represent the remainder term and constrain it to belong
to the Lagrange remainder interval [↵] (equation (9.25)). The concrete version
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
9.5. optimiz ation-based nonlinear reachability 189
0 2 4 6 8 10 12 14
x
function as a set of mixed integer contraints turns equation (9.27) into a MILP, 14
A detailed overview of integer
which we can solve using a variety of algorithms.14 programming can be found in L. A.
Wolsey, Integer Programming. Wi-
While many real-world nonlinear systems do not have piecewise linear reach- ley, 2020. Modern solvers, such
ability functions, we can overapproximate them with piecewise linear bounds. as Gurobi and CPLEX, can rou-
tinely handle problems with mil-
First, we decompose the reachability function into a conjunction of elementary lions of variables. There are pack-
nonlinear functions (see example 9.6). For each nonlinear elementary function, ages for Julia that provide access
to Gurobi, CPLEX, and a variety of
other solvers.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
190 chap ter 9. re achability for nonlinear systems
Consider the following piecewise linear function (shown in the caption): Example 9.4. Writing the ReLU
function in terms of the max func-
8 tion.
<0 if x < 0
f (x) =
: x otherwise
This function is often referred to as the rectified linear unit (ReLU) function
and is commonly used in neural networks. We can rewrite this function in
terms of the max function as follows:
f ( x ) = max(0, x )
If x < 0, the max function will return 0, and if x 0, the max function will
return x.
we can derive piecewise linear lower and upper bounds over a given interval.
We can then convert those bounds to mixed-integer constraints and solve the
resulting MILP to overapproximate the reachable set.15 15
For more details on the process of
deriving the bounds and convert-
ing to constraints, see C. Sidrane, A.
9.6 Partitioning Maleki, A. Irfan, and M. J. Kochen-
derfer, “OVERT: An Algorithm for
Safety Verification of Neural Net-
The methods presented in this chapter tend to result in less overapproximation work Control Policies for Nonlin-
error when computing reachable sets over smaller regions of the input space. For ear Systems,” Journal of Machine
Learning Research, vol. 23, no. 117,
example, Taylor approximations are more accurate for points near the center of the
pp. 1–45, 2022.
region and become less accurate as we move away from the center (figure 9.15).
Therefore, we want to keep the input set for Taylor inclusion functions and Taylor
models as small as possible to minimize overapproximation error. 1
gorithms by partitioning the input set into smaller regions and computing the 0
reachable set for each region separately. Specifically, we divide the input set S
into a set of smaller regions S (1) , S (2) , . . . , S (m) such that 1
m 2 1 0 1 2
[
S= S (i )
(9.30) x
i =1
Figure 9.15. First-order Taylor ap-
(i ) proximation (dashed blue line) for
To compute the reachable set at depth d, we compute the reachable set Rd for the function f ( x) = x sin( x)
each region S (i) separately and then combine the results to form the reachable (gray) at centered x = 0. The ap-
proximation is more accurate near
the center.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
9.6. partitioning 191
Suppose we want to solve an optimization problem with the following piece- Example 9.5. Mixed-integer formu-
lation of the ReLU function.
wise linear constraint:
y = max(0, x )
We will also assume that we know that x lies in the interval [ x, x ]. We can
encode this constraint using a set of mixed-integer constraints as follows:
yx x (1 a)
y x
y xa
y 0
a 2 {0, 1}
The plots below iteratively build up the constrained region for each possible
value of a.
yx x y x y 0 y0
a=0
x x x x x x x x
yx y 0 yx y x
a=1
x x x x x x x x
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
192 chap ter 9. re achability for nonlinear systems
Consider the nonlinear constraint y = x sin( x ) over the region 2 x 2. Example 9.6. Converting a nonlin-
ear equality constraint into a set
We can convert this constraint into a set of piecewise linear constraints by of mixed-integer constraints using
first decomposing the function into its elementary functions: piecewise linear bounds. We use
the OVERT.jl package to compute
the overapproximations.
y=x z
z = sin( x )
2x2
We then derive a piecewise linear lower bound z and upper bound z for
sin( x ) and rewrite the constraints as
y=x z
zzz
z = z( x )
z = z( x )
2x2
The plots below show the overapproximations of sin( x ) using different num-
bers of linear segments.
1 z( x ) z( x ) z( x )
f (x)
1 z( x ) z( x ) z( x )
2 0 2 2 0 2 2 0 2
x x x
The final step is to convert the piecewise linear functions z( x ) and z( x ) into
their corresponding mixed-integer constraints. The overapproximations be-
come tighter as the number of segments increases, but the computational
cost and the number of mixed integer constraints required to represent the
piecewise linear bounds also increases.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
9.6. partitioning 193
Input Set Input Partition Output Partition Output Set Figure 9.16. Computing the one
step reachable set for the inverted
S S (1) S (2) pendulum system using partition-
ing. The input set S is partitioned
R (3) R into four regions, and the reachable
R (4)
set for each region is computed sep-
R (1) arately using a first order Taylor in-
R (2)
clusion function. The union of the
S (3) S (4) resulting output sets forms the full
reachable set.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
194 chap ter 9. re achability for nonlinear systems
We can use the techniques discussed in the previous sections to verify properties
of neural networks. Neural networks are a class of functions that are widely used
in machine learning and could be used to represent the agent, environment, or
sensor model. They are composed of a series of layers, each of which applies
an affine transformation followed by a nonlinear activation function.17 Given a 17
More details about the structure
set of inputs to a neural network, we are often interested in understanding the and training of neural networks are
found in appendix D.
possible outputs.18 For example, we may want to ensure that an aircraft collision 18
This process is sometimes re-
avoidance system will always output an alert when other aircraft are nearby. ferred to as neural network verifica-
Evaluating a neural network is similar to performing a rollout of a system. tion. A detailed overview of neural
network verfication can be found
However, instead of computing st+1 by passing st through the sensor, agent, and in C. Liu, T. Arnon, C. Lazarus,
environment models, we compute it by passing st through the tth layer of the C. Strong, C. Barrett, and M. J.
Kochenderfer, “Algorithms for Ver-
neural network. If st is the input to layer t, then the output st+1 is computed as ifying Deep Neural Networks,”
Foundations and Trends in Optimiza-
st +1 = f (Wt st + bt ) (9.32) tion, vol. 4, no. 3–4, pp. 244–404,
2021.
where Wt is a matrix of weights, bt is a bias vector, and f(·) is a nonlinear ac-
tivation function. Common activation functions include ReLU, sigmoid, and
hyperbolic tangent. Figure 9.18 shows an example of a two-layer neural network.
In this context, we can check properties of the neural network by computing the
reachable set of the output layer given an input set.
For piecewise linear activation functions, we can compute the exact reachable
set by partitioning the input space into different activation sets and computing
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
9.7. neural networks 195
1 0
0.4
0.5
0.2 1
0 0
2 1 0 1 2 2 1 0 1 2 2 1 0 1 2
x x x
Input Set Layer 1 Output Output Set Figure 9.20. Computing the over-
approximate reachable set of a two-
s12 s22 s32 layer neural network using natural
inclusion functions. The true reach-
able set for each layer is shown in
blue, and the interval overapproxi-
s11 s21 s31 mation is shown in purple.
the reachable set for each subset separately.19 For example, we can compute exact 19
W. Xiang, H.-D. Tran, J. A. Rosen-
reachable sets for neural networks with ReLU activation functions (example 9.7). feld, and T. T. Johnson, “Reachable
Set Estimation and Safety Verifi-
However, the number of subsets grows exponentially with the number of nodes cation for Piecewise Linear Sys-
in the network. Therefore, exact reachability analysis is often intractable for large tems with Neural Network Con-
trollers,” in American Control Con-
neural networks, so it is common to instead use overapproximation techniques to ference (ACC), 2018.
bound the output set.
Similar to the nonlinear systems discussed earlier, we can use inclusion func-
tions to overapproximate the output set of neural networks. By replacing each
activation function with its interval counterpart, we obtain the natural inclusion
function for a neural network. Figure 9.19 shows an example evaluation of the
interval counterpart for the ReLU function. Evaluating the natural inclusion func-
tion for the network on a set of input intervals provides an overapproximation of
the possible network outputs (figure 9.20).
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
196 chap ter 9. reachability for nonlinear systems
Suppose we want to propagate the input set S1 shown below through the Example 9.7. Exact reachability for
a two-layer neural network with
first layer of the neural network in figure 9.18. We first apply the linear ReLU activation functions.
transformation to obtain the pre-activation region Z2 = W1 S1 b1 :
s11 z11
To compute the final output set of the neural network in figure 9.18, we would
repeat this process for each of the subsets that comprise S2 . The final output
set will therefore be the union of 16 subsets.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
9.8. summary 197
where f n (s1 ) is the neural network function and S is the input set. For ReLU
networks, it is possible to write this optimization problem as a MILP by converting
each ReLU activation function into its corresponding mixed-integer contraints
(see example 9.5). To create the mixed integer constraints, we need an upper 22
M. Akintunde, A. Lomuscio, L.
and lower bound on the input to each ReLU. We can either select a sufficiently Maganti, and E. Pirovano, “Reach-
ability Analysis for Neural Agent-
large bound for all nodes22 or compute specific bounds by evaluating the natural Environment Systems,” in Inter-
inclusion function.23 To compute an overapproximation of the output set, we national Conference on Principles of
Knowledge Representation and Rea-
evaluate the support function in multiple directions. soning, 2018.
In addition to evaluating the support function, we can use the MILP formula- 23
V. Tjeng, K. Y. Xiao, and R.
tion to check other properties of the neural network by changing the objective Tedrake, “Evaluating Robustness
of Neural Networks with Mixed
function or adding constraints.24 For example, we can check if the output set Integer Programming,” in Interna-
intersects with a given avoid set or find the maximum disturbance that causes the tional Conference on Learning Repre-
network to change its output. In general, neural network verification approaches sentations (ICLR), 2018.
24
C. A. Strong, H. Wu, A. Zeljic,
can be combined with the techniques discussed in this chapter to verify closed- K. D. Julian, G. Katz, C. Barrett,
loop properties of systems that contain neural networks.25 and M. J. Kochenderfer, “Global
Optimization of Objective Func-
tions Represented by ReLU Net-
9.8 Summary works,” Machine Learning, vol. 112,
pp. 3685–3712, 2023.
25
M. Everett, G. Habibi, C. Sun,
• Reachable sets for nonlinear systems are often nonconvex and difficult to
and J. P. How, “Reachability Anal-
compute exactly. ysis of Neural Feedback Loops,”
IEEE Access, vol. 9, pp. 163 938–
• We can apply a variety of techniques to overapproximate the reachable sets of 163 953, 2021.
nonlinear systems.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
198 chap ter 9. re achability for nonlinear systems
• We can use interval arithmetic to create inclusion functions that provide over-
appoximate output intervals for nonlinear functions.
• We can sample the support function of the reachable set for nonlinear sys-
tems by solving an overapproximate linear program or mixed-integer linear
program.
• We can extend some of the techniques outlined in this chapter to analyze the
output sets of neural networks.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
10 Reachability for Discrete Systems
While the techniques in chapters 8 and 9 focus on reachability for systems with
continuous states, this chapter focuses on reachability for systems with discrete
states. We begin by representing the transitions of a discrete system as a directed
graph. This formulation allows us to use graph search algorithms to perform
reachability analysis. Next, we discuss techniques for probabilistic reachability
analysis, in which we calculate the probability of reaching a particular state or
set of states. We conclude by discussing a method to apply these techniques to
continuous systems by abstracting them into discrete systems.
each edge represents a transition between states (figure 10.1). We can also associate Figure 10.1. Graph representation
a probability with each edge to represent the likelihood of the transition occurring. of a discrete system with two states,
Algorithm 10.1 creates a directed graph from a discrete system. For each dis- s1 and s2 . The graph has a node
for each state and an edge originat-
crete state, it computes the set of possible next states and their corresponding ing from each state for each possi-
probabilities. It then adds an edge to the graph for each possible transition. Fig- ble transition. Each edge is labeled
with the probability of the transi-
ure 10.2 shows the graph representation of the grid world system. For systems tion. For example, when we are in
with large state spaces, it may be inefficient to store the full graph in memory. In s1 , we have a 0.8 probability of tran-
these cases, we can represent the graph implicitly using a function that takes in a sitioning from s1 to s2 .
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
10.2. reachable sets 201
To compute reachable sets, we ignore the probabilities associated with the edges of
the graph and focus only on its connectivity. The reachable sets are represented as
collections of discrete states. We focus on two types of reachability analysis: forward
reachability and backward reachability. Forward reachability analysis determines the
set of states that can be reached from a given set of initial states within a specified
time horizon.1 . Backward reachability analysis determines the set of states from 1
This process is sometimes re-
which a given set of target states can be reached within a specified time horizon. ferred to as bounded model checking
Figure 10.3 demonstrates the difference between the two types of reachability
analysis, and the rest of this section presents algorithms for each type.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
202 chap ter 1 0. reachability for discrete systems
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
10.3. satisfiability 203
The reachable set has converged once it no longer changes. If we find that
R1:d = R1:d 1 , the reachable set has converged, and R1:• = R1:d .2 We can also 2
This condition allows us to per-
check for invariant sets by relaxing this condition. Specifically, if R1:d ✓ R1:d 1 , form unbounded model checking, in
which the output holds over all pos-
we can conclude that R1:d is an invariant set and that the system will remain sible trajectories.
within this set for all future time steps (R1:• ✓ R1:d ). Performing this check
on discrete sets is straightforward because we can directly compare the states
contained in each set.
10.3 Satisfiability
We can use the forward and backward reachable sets of discrete systems to
determine whether they satisfy a reachability specification (figure 10.6). For
forward reachability, we check whether the target set intersects with the forward
reachable set. For backward reachability, we check whether the initial set intersects
with the backward reachable set. In both cases, these checks require us to compute
the full forward or backward reachable set. This process can be computationally
expensive, especially for systems with large state spaces.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
204 chap ter 10. reachability for discrete systems
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
10.3. satisfiability 205
ST ST ST ST
further increase efficiency by using heuristics to prioritize paths that are more
likely to lead to a counterexample. In cases where the system satisfies the specifi-
cation and no counterexample exists, these algorithms have the same computa-
tional complexity as breadth-first search. Figure 10.7 compares the performance
of breadth-first search, depth-first search, and heuristic search for finding coun-
terexamples in the grid world problem.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
10.3. satisfiability 207
The wildfire problem is an example of a problem in which graph search is Example 10.1. Demonstration of
difficulties that arise when apply-
intractable. Consider a wildfire scenario modeled as an n ⇥ n grid where ing graph search algorithms to the
each cell is either burning or not burning. At each time step, a burning cell wildfire problem.
has a nonzero probability of spreading the fire to each of its neighboring
cells. A burning cell will also remain burning at the next time step with
some probability. This problem has 2n possible states, and a state with b
2
burning cells has as many as 25b possible successors. For a 5 ⇥ 5 grid, the state
space has 225 = 3.4 ⇥ 107 states. For a 10 ⇥ 10 grid, that number increases to
2100 = 1.27 ⇥ 1030 possible states. The example below shows the successors
for a state where only the cell in the center is burning.
···
Even though only one cell is burning, there are still 32 successor states.
This number only increases as we increase the number of burning cells. A
state with 10 burning cells has as many as 250 = 1.13 ⇥ 1015 successors.
For most grid sizes, even partially computing and storing the graph for the
wildfire problem is intractable. For this reason, we cannot use graph search
algorithms for this problem and must turn to other methods such as Boolean
satisfiability.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
208 chap ter 10. reachability for discrete systems
Consider a wildfire problem with a 10 ⇥ 10 grid and a time horizon of h = 20. Example 10.2. Encoding the initial
state of the wildfire problem as a
The state at a particular time step is represented as a set of Boolean variables propositional logic formula using
that represent whether each grid cell is burning. The SAT problem will the Satisfiability.jl package.
therefore have 100 ⇥ 20 = 2000 Boolean variables representing the states
at each time step. We can represent the initial state as a propositional logic
formula that evaluates to true when the bottom left cell is burning and all
other cells are not burning. The following code implements this formula:
n = 10 # grid is n x n
h = 20 # time horizon
@satvariable(burning[1:n, 1:n, 1:h], Bool)
init = burning[1, 1, 1] # bottom left cell is burning
for i in 1:n, j in 1:n
if i ≠ 1 || j ≠ 1 # all other cells are not burning
init = init ∧ ¬burning[i, j, 1]
end
end
Combining the initial state and transition formulas with the failure condition,
we can create a single propositional logic formula that represents the reachability
problem:
A SAT solver will search the space of possible values for the Boolean variables s1:h
to find an assignment that satisfies the formula. A satisfying assignment corre-
sponds to a feasible trajectory that satisfies the failure condition. Therefore, if the
SAT solver determines that there are no satisfying assignments, we can conclude
that the system satisfies the specification. Example 10.4 demonstrates how to use
Boolean satisfiability to check reachability specifications for the wildfire problem.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
10.3. satisfiability 209
The following code implements the propositional logic formula for the tran- Example 10.3. Encoding the tran-
sitions of the wildfire problem as a
sitions of the wildfire problem: propositional logic formula using
transition = true the Satisfiability.jl package.
for i in 1:n, j in 1:n, t in 1:h-1
transition = transition ∧ (
burning[i, j, t+1] ⟹
(burning[i, j, t] ∨
burning[max(1, i-1), j, t] ∨
burning[min(n, i+1), j, t] ∨
burning[i, max(1, j-1), t] ∨
burning[i, min(n, j+1), t])
)
end
T ( ,
)= true
T ( ,
)= false
In the first case, both cells burning at time t + 1 were either burning at time t
or had a neighbor that was burning at time t. In the second case, the cell at
(3, 4) was not burning at time t, and none of its neighbors were burning at
time t.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
210 chap ter 1 0. reachability for discrete systems
Suppose there is a densely populated area in the top right cell of the wildfire Example 10.4. Checking reachabil-
ity specifications for the wildfire
grid, and we want to determine whether it might burn. We can encode the problem using Boolean satisfiabil-
failure condition as a propositional logic formula that evaluates to true when ity.
the top right cell is burning. We can then combine this formula with the
initial state and transition formulas from equation (10.1) and pass it to a SAT
solver to determine whether the top right cell is reachable. The following
code demonstrates this process:
ψ = ¬burning[n, n, t]
reachable = sat!(init ∧ transition ∧ ¬ψ)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
10.4. probabilistic reachability 211
Consider the grid world problem with a slip probability of 0.3. Running Example 10.5. Comparison of
reachable set analysis and proba-
algorithm 10.2 with a time horizon h = 9 leads to the conclusion that the bilistic forward reachability analy-
system is unsafe because the obstacle is included in the forward reachable sis on the grid world problem.
set. However, the probability of reaching the obstacle after 9 steps when
following the optimal policy is only 0.0004, and the system is more likely to
be in a state near its nominal path to the goal. In this scenario, the probabilistic
reachability provides a more useful assessment of the actual safety of the
system. The plots below show the reachable set (left) and the results of a
probabilistic reachability analysis (right).
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
212 chap ter 1 0. reachability for discrete systems
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
10.4. probabilistic reachability 213
The plots below show the results from probabilistic occupancy analysis on the Example 10.6. Determining occu-
pancy probabilities for the grid
grid world problem with a slip probability of 0.3. They show the distribution world problem.
over reachable states at different time steps with reachable states appearing
larger and darker states indicating a higher probability of reaching them.
The nominal path is highlighted in gray.
While the obstacle state is reachable in three of the plots, the probability of
occupying the obstacle state is low and the probability is much higher for
states near the nominal path. After 50 time steps, most of the probability
mass is in the goal state with a small portion in the obstacle state and the
other grid cells. At this point, the probability of being in the goal state is
0.981 and the probability of being in the obstacle state is 0.018. We can use
these numbers to draw conclusions about the overall safety of the system.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
214 chap ter 10. reachability for discrete systems
In other words, for states in the target set, the probability of reaching the target
set is 1. For all other states, the probability of reaching the target set within t + 1
time steps is sum of the probability of transitioning to each of its successors times
the probability that they reach the target set within t time steps. We initialize R1
to be 1 for states in the target set and 0 otherwise.
We can use the results of this analysis to identify dangerous states for the
system. Furthermore, if we know the initial state distribution P1 for the system,
we can determine the probability of reaching the target set within a given time
horizon by summing the probability of reaching the target set from each state
weighted by the probability of occupying that state at time t = 1:
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
10.4. probabilistic reachability 215
Preach
many scenarios, running the analysis for a sufficiently long time horizon is enough
0.4
to draw conclusions about the overall safety of the system. However, it is also obstacle
0.2
possible to compute the probability of reaching the target set in the limit as the
0
time horizon approaches infinity. This probability is known as the infinite-horizon 50 100 150 200
To compute this probability, we rewrite the recursive relationship in equa- Figure 10.9. Probability of reach-
tion (10.3) as ing the goal state and the obsta-
cle state in the grid world problem
Rt+1 (s) = R1 (s) + Â TR (s, s0 ) Rt (s0 ) (10.5) with a slip probability of 0.6 as a
s0 2S function of the time horizon. We
where 8 assume the system is initialized in
the bottom left corner. As the hori-
<0 if s 2 S T zon increases, the probabilities be-
TR (s, s0 ) = (10.6)
: T (s, s0 ) otherwise gin to converge.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
216 chap ter 1 0. r eachability for discrete systems
Rt+1 = R1 + TR Rt (10.7)
where Rt is a vector of length |S| such that the ith entry corresponds to Rt (si ),
and TR is a matrix of size |S| ⇥ |S| such that entry in the ith row and jth column
corresponds to TR (si , s j ).10 10
This formulation is equivalent to
For an infinite horizon, we have that a Markov reward process with an
immediate reward of 1 for all states
in the target set and 0 otherwise.
R• = R1 + TR R• (10.8) The states in the target set are ter-
minal states.
We can solve for R• by rearranging the terms in equation (10.8) to get
R• TR R• = R1 (10.9)
(I TR )R• = R1 (10.10)
R• = (I TR ) 1
R1 (10.11)
The methods discussed in this chapter apply only to discrete systems. However,
we can use them to produce overapproximate reachability results for continuous
systems by creating a discrete state abstraction (DSA). To create a discrete state
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
10.5. discrete state abstractions 217
Suppose we want to understand the probability of reaching the obstacle state Example 10.7. Infinite-horizon
probability of reaching the obsta-
for grid world problems with different slip probabilities. The plots below cle for different slip probabilities in
show the results of infinite-horizon reachability analysis with the obstacle as the grid world problem.
the target set for slip probabilities of 0.3, 0.5, and 0.7. For each slip probability,
we compute Pfail assuming we start in the bottom left corner of the grid.
abstraction, we partition the continuous state space into a finite number of smaller
regions. We then create a graph where the nodes correspond to the regions, and
the edges correspond to transitions between regions. Figure 10.10 shows the
process of creating a DSA for the inverted pendulum problem.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
218 chap ter 10. r eachability for discrete systems
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
10.6. summary 219
• Backwards reachability algorithms begin with a set of target states and calculate
the set of states that can reach the target set in a given time horizon.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
220 chap ter 1 0. reachability for discrete systems
We can create a DSA for the inverted pendulum system using algorithm 10.1 Example 10.8. Creating a DSA for
the inverted pendulum system us-
by defining the states function to partition the state space into a grid of ing algorithm 10.1. The plots show
regions and the successors function to determine the connectivity of the the process of determining the con-
graph using a nonlinear forward reachability technique such as conservative nectivity of the graph for a single
linearization. Example implementations are as follows: region S (i) . The plot below shows
the graph for the final DSA with a
function states(env::InvertedPendulum; nθ=8, nω=8)
uniform partition of the state space
θs, ωs = range(-1.2, 1.2, length=nθ+1), range(-1.2, 1.2, length=nω+1)
into 64 regions.
𝒮 = [Hyperrectangle(low=[θlo, ωlo], high=[θhi, ωhi])
for (θlo, θhi) in zip(θs[1:end-1], θs[2:end])
for (ωlo, ωhi) in zip(ωs[1:end-1], ωs[2:end])] 1
return 𝒮
end
w (rad/s)
function successors(sys, 𝒮⁽ⁱ⁾) 0
_, 𝒳 = sets(sys, 2)
ℛ⁽ⁱ⁾ = conservative_linearization(sys, 𝒮⁽ⁱ⁾ × 𝒳)
ℛ⁽ⁱ⁾ = VPolytope([clamp.(v, -1.2, 1.2) for v in vertices_list(ℛ⁽ⁱ⁾)])
𝒮⁽ʲ⁾s = filter(𝒮⁽ʲ⁾->!isempty(ℛ⁽ⁱ⁾ ∩ 𝒮⁽ʲ⁾), states(sys.env)) 1
return 𝒮⁽ʲ⁾s, ones(length(𝒮⁽ʲ⁾s)) 1 0 1
end
q (rad)
The plots below demonstrate the successors function on an example state
S (i) . The function first computes R(i) using conservative linearization (left).
It then determines the regions S ( j) that intersect with R(i) (middle). Finally,
the function returns these regions so that they can be connected in the graph
(right). The edge weights can be ignored when computing reachable sets.
R (i )
w (rad/s)
0
S (i )
1
1 0 1 1 0 1 1 0 1
q (rad) q (rad) q (rad)
Algorithm 10.1 calls the successors function for each region in the partition
to determine the connectivity of the graph. The result is shown in the caption.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
10.6. summary 221
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
222 chap ter 10. reachability for discrete systems
Suppose we have a continuum world problem with Gaussian disturbances Example 10.9. Overapproximation
the transition probabilities for the
on its transitions. For example, if the agent takes the up action, its next DSA of the continuum world sys-
position is sampled from a Gaussian distribution with a mean 1 unit above tem.
its current state and a standard deviation of 1 in each direction. In other words,
T (s, s0 ) = N (s0 | s + d, I) where d is the direction vector corresponding to
the action taken in the state s. Our goal is to determine the overapproximated
transition probabilities T (S , S 0 ) for a DSA of the continuous system.
To obtain the probability of transitioning from a specific state s to a
region in the partition S 0 , we integrate the transition function such that
R
T (s, S 0 ) = S 0 T (s, s0 ) ds0 . To obtain an overapproximation of the transition
probabilities, we select the transition from the current region S 0 that re-
sults in the highest probability of reaching the target region S 0 such that
T (S , S 0 ) = maxs2S T (s, S 0 ). The plots below show the transition probabili-
ties for a single state s to the regions in the DSA. The plots below demonstrate
this process.
T (s, s0 ) T (s, S 0 ) T (S , S 0 )
s s S
The maximization in the formula for T (S , S 0 ) finds the state in S that puts
the highest amount of probability mass in S 0 . The plots below demonstrate
this maximization for three different next regions S 0 . This process produces
an overapproximation of the transition probabilities since we assume all
states in S transition to the worst-case next state.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
11 Runtime Monitoring
12.1 Explanations
q (rad)
h (m)
0 0 axis.
200
1
400
0 10 20 30 40 0 0.2 0.4 0.6 0.8 1
Time (s) Time (s)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12.3. feature importance 227
by performing rollouts of the policy and plotting the state of the system at each
time step. This visualization can help us understand how the policy behaves in 1 20
different scenarios and identify potential failure modes. Figure 12.1 shows an
example of this visualization technique for the aircraft collision avoidance and 0 0
w
inverted pendulum policies.
If the policy is Markov and therefore depends only on the current state, we can 1 20
also visualize it directly by plotting the action taken by the agent in each state. If 1 0 1
the state space is two-dimensional as in the inverted pendulum example, we can q
plot the action taken by the agent as a two-dimensional heatmap (figure 12.2). For Figure 12.2. Visualization of the
higher-dimensional state spaces, we often need to apply dimensionality reduction actions taken by the inverted pen-
dulum policy. The colors represent
techniques to visualize the policy. One common technique is to fix all but two of the torque applied in each state.
the state variables, which become associated with the vertical and horizontal axes.
We can indicate the action for every state with a color. Example 12.1 demonstrates
this technique for the collision avoidance policy.
Instead of fixing the hidden state variables, we could also use various tech-
niques to aggregate over them (figure 12.3). One method involves partitioning
the state space into a set of regions and keeping track of the actions taken in
each region over a series of rollouts. We can then aggregate over these actions by
plotting the mean or mode of the actions taken in each region. One benefit of this
technique is that it relies only on rollouts of the policy and therefore extends to
non-Markovian policies. Because all states may not be reachable in practice, some
areas of the policy plot may have no data associated with it.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
228 chap ter 12. ex plainability
The figure in the caption shows policy plots for the aircraft collision avoidance Example 12.1. Aircraft collision
avoidance policy when the rela-
policy when the relative vertical rate is fixed at 0 m/s and 4 m/s, and the tive vertical rate is fixed at 0 m/s
previous action is fixed at no advisory. The red aircraft represents the relative (left) and 4 m/s (right),and the
previous action is fixed at no advi-
location of the intruder aircraft. We can use these plots to explain the behavior sory. The colors represent the ac-
of the policy in these scenarios. For example, we can see that when the relative tion taken by the agent in each
vertical rate is fixed at zero, the policy advises our aircraft to climb when it is state.
above the intruder and descend when it is below the intruder. This behavior
is aligned with our objective of avoiding collisions.
The plot on the left also reveals some potentially unexpected behaviors. For
example, when the time to collision is near zero and a collision is imminent,
the policy results in no advisory. This behavior may prompt us to perform
further analysis. For example, a counterfactual analysis (see section 12.5)
reveals that a collision is inevitable in this scenario regardless of the action
taken by the agent due to limits on the vertical rate of the aircraft.
dh = 0 m/s dh = 4 m/s
400
no advisory
descend 200
climb
h (m)
200
400
40 30 20 10 0 40 30 20 10 0
tcol (s) tcol (s)
0 climb
taken by the agent in each state.
On the right, we plot the action
200 taken most frequently by the agent
in each state.
400
40 30 20 10 0 40 30 20 10 0
tcol (s) tcol (s)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12.3. feature importance 229
Low High
12.3.1 Sensitivity Analysis Sensititivity Sensititivity
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
230 chap ter 1 2. explainability
Consider a wildfire scenario modeled as a grid where each cell is either Example 12.2. Motivation for con-
sidering the interactions between
burning or not burning. At each time step, there is a 30 % chance that a cell features when determining feature
that was not burning at the previous time step will be burning if at least one importance.
of its neighbors was burning. The plots below show an example of a current
state st and the probability that each cell is burning at the next time step
p(st+1 ) (darker cells indicate higher probability). Suppose we are interested
in understanding the features that are most important in determining the
probability that the cell in the upper right corner will burn.
st p ( s t +1 )
⇤ ⇤
For this example, we will focus specifically on the feature that indicates
whether the cell directly to the left of the upper right cell is burning. We
can test the first definition of feature importance by changing that cell to
not burning while holding all other cells constant and observing the effect
on the probability that the upper right cell will burn. In this case (leftmost
plots), the probability that the upper right cell will be burning at the next
time step does not change. Therefore, we will conclude that this cell has no
contribution to the output. However, if we remove fire from both this cell
and the cell below the upper right cell (rightmost plots), the upper right
cell changes to zero probability of burning at the next time step. The second
definition of feature importance considers the interaction between these two
features and would conclude that the cell does contribute to the output.
⇤ ⇤ 4 ⇤ ⇤
0
024
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12.3. feature importance 231
Suppose we have an agent that selects a steering angle for an aircraft based Example 12.3. Sensitivity analysis
at a single time step. Brighter pix-
on runway images from a camera mounted on its wing. Given a particular els in the sensitivity map indicate
input image, we can generate a sensitivity map to identify the pixels that are pixels with higher sensitivity.
most important in determining the steering angle by fixing all but the pixel
of interest and checking its effect on the steering angle output. The results
are shown below, where the left image is the original image and the right
image is the sensitivity map. This analysis indicates that the agent is focusing
on the portion of the runway in front of it where the lines are most visible.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
232 chap ter 1 2. ex plainability
We can use sensitivity analysis to understand the effect of disturbances on the Example 12.4. Sensitivity analysis
over a full trajectory. Brighter col-
outcome of a trajectory. For example, consider an inverted pendulum system ors in the sensitivity map of the
in which the agent’s observation of its current angle is subject to a noise inverted pendulum trajectory indi-
cate higher sensitivity. The black
disturbance. We can estimate the sensitivity of the robustness of a trajectory line shows the true angle of the
with respect to its disturbances by perturbing the disturbances at each time pendulum at each time step and
step and observing the effect on the robustness of the trajectory. The results the colored markers indicate the
noisy observation of the current an-
on a given failure trajectory are shown below. This analysis indicates that gle at each time step.
small changes in the disturbances at the beginning of the trajectory have a
large effect on the robustness of the trajectory. Furthermore, the disturbances
applied towards the end of the failure trajectory have little to no effect because
the controller is saturated and the system cannot recover.
1
q (rad)
1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (s)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12.3. feature importance 233
1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (s)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
234 chap ter 1 2. explainability
A simple way to produce a saliency map given a set of inputs is to take the 6
D. Baehrens, T. Schroeter, S.
Harmeling, M. Kawanabe, K.
gradient of the output of interest with respect to the inputs.6 The saliency of a
Hansen, and K.-R. Müller, “How
particular input is related to the magnitude of the gradient at that input. A high to Explain Individual Classi-
gradient magnitude indicates that small changes in the input will result in large fication Decisions,” Journal of
Machine Learning Research, vol. 11,
changes in the output. In other words, inputs with high gradient values are more pp. 1803–1831, 2010.
salient and indicate higher sensitivity. This method is often used to determine the
7
K. Simonyan, A. Vedaldi, and A.
components of an observation (such as the pixels of an image) that contribute
Zisserman, “Deep Inside Convolu-
most to an agent’s decision.7 We can also use it to approximate sensitivity over tional Networks: Visualising Image
a full trajectory by taking the gradient of a performance measure with respect Classification Models and Saliency
Maps,” in International Conference
to input features such as actions or disturbances. Algorithm 12.2 measures the on Learning Representations (ICLR),
sensitivity of the robustness of a trajectory with respect to its disturbances, and 2014.
figure 12.5 shows an example on the inverted pendulum system. 8
For image inputs in particular, it
has also been shown that there are
While algorithm 12.2 is more computationally efficient than algorithm 12.1, it
sometimes meaningless local vari-
is limited by its local nature. Important input features, for example, often saturate ations in gradients that can lead to
the output function of interest, causing the gradient to be small even when the noisy sensitivity maps. D. Smilkov,
N. Thorat, B. Kim, F. Viégas, and
feature is important.8 The integrated gradients9 algorithm addresses this limitation M. Wattenberg, “Smoothgrad: Re-
by averaging the gradient along the path between a baseline input and the input moving Noise by Adding Noise,”
in International Conference on Ma-
of interest (figure 12.6). The choice of baseline depends on the context. For images,
chine Learning (ICML), 2017.
a common choice is a black image (figure 12.7). For disturbances, we can set all 9
M. Sundararajan, A. Taly, and
disturbances to zero. Q. Yan, “Axiomatic Attribution
for Deep Networks,” in Interna-
Algorithm 12.3 calculates the sensitivity of the robustness of a trajetory with
tional Conference on Machine Learn-
respect to the disturbances at each time step using integrated gradients. It takes ing (ICML), 2017.
m steps along the path between the baseline and the current input and computes
the gradient of the robustness at each step. The algorithm then returns the av-
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12.3. feature importance 235
Original Image Sensitivity Gradient Integrated Gradients Figure 12.8. Comparison of the
sensitivity descriptions using algo-
rithms 12.1 to 12.3 for an aircraft
taxi system that selects a steering
angle from an image observation.
The sensitivity map focuses on the
portion where the edge and cen-
ter lines are most apparent, while
the gradient-based methods focus
only on the edges of the runway.
The integrated gradients method
provides a smoother map than the
single gradient approach.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
236 chap ter 1 2. explainability
erage gradient at each time step. As m approaches infinity, the average gradient
approaches the integral of the gradient along the path. Figure 12.8 compares the
sensitivity estimates from algorithms 12.1 to 12.3 for an aircraft taxi system. All
three methods produce slightly different descriptions of the agent’s behavior, and
in general, the most appropriate sensitivity estimate is application dependent.
The Shapley value fi of feature i is then defined as the average marginal contri-
bution of feature i to the expectation of the outcome over all possible subsets of
features:
|Is |!(n |Is | 1) !
fi (x) = Â ( f Is [{i} (x) f Is (x)) (12.2)
Is ✓I\{i }
n!
Intuitively, computing the Shapley value of feature i involves looping over all
possible subsets of features that do not include i and computing the difference in
the expectation of the outcome when adding i to the subset. The constant factor
in equation (12.2) ensures that subsets of different sizes are weighted equally. In
general, Shapley values are expensive (often intractable) to compute due to the
large number of possible subsets. For example, a function with 100 input features
has 6.3 ⇥ 1029 possible subsets.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12.4. policy explanation through surrogate models 237
where p (n) represents the set of all possible permutations of n elements, j is the
index in the permutation P that corresponds to feature i, and P1:j represents the
first j elements of P . We can then approximate the Shapley value using sampling.
For each sample, we randomly permute the features and compute the difference
in the expectation of the outcome when adding feature i to the features before it
in the permutation.
Algorithm 12.4 estimates the Shapley values for the disturbances in a trajectory
to determine their contribution to the robustness of the trajectory. It takes in a
current trajectory t with disturbance trajectory x and a number of samples per
time step m. For each time step in the trajectory, the algorithm randomly samples
another disturbance trajectory w by performing a rollout using the nominal
trajectory distribution.12 It then samples a random permutation P of the time 12
This step requires that the distur-
steps and performs a rollout in which the disturbances are taken from x for the bances sampled at each time step
are independent of one another.
time steps in P1:j and from w for all other time steps. It similarly performs a This assumption may break if the
rollout in which the disturbances are taken from x for the time steps in P1:j 1 and disturbances depend on the states,
actions, or observations.
from w for all other time steps. The algorithm then computes the difference in
the robustness of the two rollouts and averages the differences over m sampled
permutations to estimate the Shapley value of each disturbance.
Figure 12.9 shows the Shapley values for the disturbances of the inverted
pendulum trajectory used in example 12.3 and figure 12.5. The Shapley values
differ from the sensitivity estimates because they account for interactions between
disturbances. If we remove groups of disturbances with high Shapley values, it
produces a large change in the outcome.
For agents with complex policies, it may be difficult to understand the reasoning
behind their decisions. In such cases, we can build surrogate models to approximate
the policy with a model that is easier to interpret. A good surrogate model should
have the following characteristics:
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
238 chap ter 1 2. ex plainability
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12.4. policy explanation through surrogate models 239
1
q (rad)
0
one at a time all four
1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (s)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
240 chap ter 12. ex plainability
• High Fidelity: The surrogate model should accurately represent the policy. If
the surrogate model does not adequately represent the policy, the explanations
it provides may be misleading.
One common choice for a surrogate model is a linear model. Linear models have
the form
n
f (x) = Â wi x i + b (12.4)
i =1
where xi is a feature of the observation, wi is a weight for feature i, and b is the bias
term. If the action space is discrete, we may apply the logistic or softmax function
to the output of the linear model to obtain probabilities for each action. Linear
surrogate models can be used to determine feature importance. The magnitudes
of the weights of the linear model indicate the contribution of each feature to the
agent’s decision. Figure 12.10 demonstrates how to use a linear surrogate model
to describe the behavior of a collision avoidance policy in two different regions of
the observation space. This technique is particularly useful for high-dimensional
observations, where it may be difficult to visualize the policy directly.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12.4. policy explanation through surrogate models 241
Original Policy Linear Approximation Feature Weights Figure 12.10. Linear surrogate
model fit to samples in two differ-
ent local regions (highlighted cir-
100 cles) of the observation space for
a collision avoidance policy. The
left column shows the original pol-
icy where blue corresponds to the
h (m)
0
climb action, green corresponds to
the descend action, and white cor-
responds to no advisory. The col-
100 umn in the middle shows the lin-
ear surrogate model fit to the sam-
ples in the highlighted circle indi-
cated by the purple dots. The right
column shows the feature weights
100 of the linear surrogate model for
each state variable. The linear sur-
rograte model is accurate in the lo-
cal region where it was fit, but may
h (m)
0
not be accurate in other regions
of the observation space. Based
on the feature weights, the linear
100 surrogate model indicates that the
agent’s decision depends on both
40 30 20 10 0 40 30 20 10 0 h tcol the relative altitude and time to col-
tcol (s) tcol (s) lision when the relative altitude is
around 50 m. In contrast, when the
relative altitude is around 0 m, the
agent’s decision primarily depends
on the relative altitude.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
242 chap ter 12. explainability
100
40 30 20 10 0
h tcol htcol tcol (s)
t2col h2 t3col ht2col h2 t col h3
High Low
Fidelity Interpretability
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12.5. counterfactual explanations 243
2. Distance to original input: The counterfactual input should be close to the orig-
inal input t to ensure that the change is minimal, resulting in the following
objective:
f close (t 0 ) = kt 0 t k p (12.6)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
244 chap ter 1 2. ex plainability
Suppose we want to train a decision tree to approximate the slice of the Example 12.5. Simple decision tree
for the collision avoidance policy.
collision avoidance policy shown in example 12.1. The following decision The policy represented by the deci-
tree was trained on a dataset of 100,000 randomly sampled states from the sion tree is shown below.
policy slice. The decision tree has a maximum depth of 2 and uses the state 400
variables to make decisions. Nodes that split using h are shown in black,
200
nodes that split using tcol are shown in gray, and the color of the square leaf
h (m)
0
nodes are the actions taken by the agent.
200
h climb
400
tcol descend 40 30 20 10 0
no advisory tcol (s)
0
<0 0
101 98
< 101 101 < 98 98
With a maximum depth of 2, the decision tree only makes decisions based
on h. If h is positive, the tree selects whether to climb or issue no advisory
based on the magnitude of the relative altitude. Similarly, if h is negative, the
tree selects whether to descend or issue no advisory based on the magnitude
of the relative altitude. The policy represented by the decision tree is shown
in the caption. This decision tree provides a simple, interpretable model of
the agent’s policy. However, the fidelity of the decision tree is limited by its
depth, and it misses some key features of the policy that depend on the time
to collision.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12.5. counterfactual explanations 245
200
400
40 30 20 10 0
tcol (s)
400
200
h (m)
200
400
40 30 20 10 0
tcol (s)
High Low
Fidelity Interpretability
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
246 chap ter 1 2. ex plainability
3. Sparsity of the change: The difference between the original input and the counter-
factual input should be sparse. In other words, the counterfactual input should
differ in only a few features. We can use the following objective
f sparsity (t 0 ) = kt 0 t k0 (12.7)
4. Plausibility: The new input should be a plausible input. We can check plausibil-
ity using the likelihood of the counterfactual trajectory as follows:
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12.5. counterfactual explanations 247
1
8 changes
1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (s)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
248 chap ter 1 2. explainability
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12.6. failure mode characteriz ation 249
Another way to explain the behavior of a system is to characterize its failure modes.
We can use clustering algorithms to create groupings of failure trajectories that
are similar to one another. Identifying the similarities and differences between
failures helps us understand their underlying causes. One common clustering
algorithm is k-means21 (algorithm 12.6), which groups data points into k clusters 21
This algorithm is also referred
based on their similarity to one another.22 to as Lloyd’s algorithm, named af-
ter Stuart P. Lloyd (1923–2007). S.
To apply k-means, we must first extract a set of real-valued features from each Lloyd, “Least Squares Quantiza-
failure trajectory to use for clustering. Let x represent the set of features from tion in PCM,” IEEE Transactions on
Information Theory, vol. 28, no. 2,
trajectory t and f be a feature extraction function such that x = f(t ). To represent pp. 129–137, 1982.
the clusters C , k-means keeps track of k cluster centroids µ1:k in feature space and 22
A detailed overview of cluster-
assigns each trajectory to the cluster with the closest centroid to its features. We ing algorithms is provided in D.
Xu and Y. Tian, “A Comprehen-
begin by initializing the centroids to the features of k random trajectories. At each sive Survey of Clustering Algo-
iteration, k-means performs the following steps: rithms,” Annals of Data Science,
vol. 2, pp. 165–193, 2015.
1. Assign each trajectory to the cluster with the closest centroid to its feature
vector. In other words, ti is assigned to cluster C j when
2. Update the centroids to the mean of the feature vectors of the trajectories in
each cluster such that
1
(12.10)
|C j | tÂ
µj = f(t )
2C j
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
250 chap ter 12. ex plainability
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12.6. failure mode characteriz ation 251
0 to the right.
1
0 2 4 0 2 4 0 2 4 0 2 4
Time (s) Time (s) Time (s) Time (s)
The clustering results help us understand the failure modes of the system. 1
One way to interpret the clusters is to create a prototypical example for each cluster.
The prototypical example for a given cluster is the trajectory that is closest to
q (rad)
its centroid in feature space. By examining the prototypical examples, we can 0
understand the characteristics of each failure mode. Figure 12.16 shows the proto-
typical examples for the final clusters in figure 12.15. At runtime, we can assign 1
new failure trajectories to the cluster with the closest centroid to their features 0 2 4
and use the prototypical examples to explain the failure mode of the trajectory. Time (s)
Algorithm 12.6 requires us to select the number of clusters k, the distance
Figure 12.16. Prototypical exam-
function d, and the feature extraction function f. The clustering results are highly ples of failure modes for in the in-
dependent on these choices. However, selecting the number of clusters and the verted pendulum system using the
clusters in figure 12.15. The proto-
features is often a subjective process that requires domain knowledge. To select
types reveal that one failure mode
the number of clusters, we can try different values for k and select the one that involves the pendulum falling to
results in the most interpretable clusters or that minimizes a clustering objective the left, while the other involves
the pendulum falling to the right.
such as the sum of the squared distances between each point and its cluster
centroid.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
252 chap ter 12. explainability
We can also use domain knowledge to select features that are likely to capture
the underlying causes of the failures. A simple way to select features is to create a
feature vector by concatenating all of the states in the trajectory. We could create
similar feature vectors for the actions, observations, and disturbances. However,
these feature vectors will be high-dimensional and may not result in interpretable
clusters (figure 12.17).
State Features Action Features Disturbance Features Figure 12.17. Clustering failure tra-
jectories of the inverted pendulum
1 system using features consisting
of the states, actions, and distur-
bances of each trajectory, respec-
tively. The colors represent the dif-
q (rad)
0
ferent clusters. The clusters based
on the state and action features
show interpretable failure modes,
1 while the clusters based on the
0 2 4 0 2 4 0 2 4 disturbance features are less inter-
pretable.
Time (s) Time (s) Time (s)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
12.7. summary 253
failure modes. Figure 12.18 shows the clusters of failure trajectories of the inverted 1
pendulum system using the PSTL template in example 12.6.
Clustering using PSTL features requires us to select a template formula. The
q (rad)
template formula should capture the key aspects of the system that are relevant 0
to the failure modes. For systems with complex failure modes, it may be difficult
to hand-design a template formula that captures all the failure modes. In these 1
cases, we can use more sophisticated techniques that build decision trees using a 0 2 4
grammar based on temporal logic.24 Time (s)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
254 chap ter 1 2. ex plainability
The following STL formula specifies that the angle of the pendulum should Example 12.6. Example of a PSTL
template formula for the inverted
not exceed p/4 for the first 200 time steps: pendulum system. The plots show
⇣ the robustness of the formula for
p⌘ different values of f. Our goal is
y = ⇤[0,200] q <
4 to find the value of f that causes a
given trajectory to marginally sat-
If we replace the time bound with a parameter f, we obtain the following isfy the formula.
PSTL template formula:
⇣ p⌘
yf = ⇤[0,f] q <
4
The plots below show the robustness of the formula for different values of f.
The plot on the left shows a value for f such that the trajectory satisfies the
formula, the plot in the middle shows a value for f that marginally satisfies
the formula, and the plot on the right shows a value for f such that the
trajectory does not satisfy the formula.
1
f = 2.45 f = 3.85 f = 4.05
q (rad)
0 2 4 0 2 4 0 2 4
Time (s) Time (s) Time (s)
We can find the value of f that marginally satisfies yf by searching for the
value that causes the robustness to be as close as possible to zero. For this
simple formula, we can solve the optimization problem using a grid search
over the values of f. The the value of f that marginally satisfies the formula
will be the time just before the magnitude of the angle of the pendulum
exceeds p/4.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
A Problems
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
references 267
27. J. Colin, T. Fel, R. Cadène, and T. Serre, “What I Cannot Predict, I Do Not Understand:
A Human-Centered Evaluation Framework for Explainability Methods,” Advances
in Neural Information Processing Systems (NeurIPS), pp. 2832–2845, 2022 (cit. on
p. 226).
28. V. Conitzer, “Eliciting Single-Peaked Preferences Using Comparison Queries,”
Journal of Artificial Intelligence Research, vol. 35, pp. 161–191, 2009 (cit. on p. 28).
29. A. Couëtoux, J.-B. Hoock, N. Sokolovska, O. Teytaud, and N. Bonnard, “Continuous
Upper Confidence Trees,” in Learning and Intelligent Optimization (LION), 2011 (cit.
on p. 84).
30. S. Dandl, C. Molnar, M. Binder, and B. Bischl, “Multi-Objective Counterfactual
Explanations,” in International Conference on Parallel Problem Solving from Nature,
2020 (cit. on pp. 243, 246, 248).
31. T. Dang and T. Nahhal, “Coverage-Guided Test Generation for Continuous and
Hybrid Systems,” Formal Methods in System Design, vol. 34, pp. 183–213, 2009 (cit.
on p. 76).
32. P.-T. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein, “A Tutorial on the
Cross-Entropy Method,” Annals of Operations Research, vol. 134, pp. 19–67, 2005 (cit.
on p. 119).
33. P. Del Moral, A. Doucet, and A. Jasra, “Sequential Monte Carlo Samplers,” Journal of
the Royal Statistical Society Series B: Statistical Methodology, vol. 68, no. 3, pp. 411–436,
2006 (cit. on p. 127).
34. H. Delecki, A. Corso, and M. J. Kochenderfer, “Model-Based Validation as Proba-
bilistic Inference,” in Conference on Learning for Dynamics and Control (L4DC), 2023
(cit. on p. 97).
35. W. M. Dickie, “A Comparison of the Scientific Method and Achievement of Aristotle
and Bacon,” The Philosophical Review, vol. 31, no. 5, pp. 471–494, 1922 (cit. on p. 3).
36. M. Dowson, “The Ariane 5 Software Failure,” Software Engineering Notes, vol. 22,
no. 2, p. 84, 1997 (cit. on p. 7).
37. S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth, “Hybrid Monte Carlo,”
Physics Letters B, vol. 195, no. 2, pp. 216–222, 1987 (cit. on p. 102).
38. A. Duret-Lutz, “Manipulating LTL Formulas Using Spot 1.0,” in Automated Technol-
ogy for Verification and Analysis, 2013 (cit. on p. 43).
39. EASA AI Task Force, “Concepts of Design Assurance for Neural Networks,” EASA,
2020 (cit. on p. 5).
40. V. Elvira, L. Martino, D. Luengo, and M. F. Bugallo, “Generalized Multiple Im-
portance Sampling,” Statistical Science, vol. 34, no. 1, pp. 129–155, 2019 (cit. on
p. 116).
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
268 references
41. A. Engel, Verification, Validation, and Testing of Engineered Systems. John Wiley & Sons,
2010, vol. 73 (cit. on p. 1).
42. J. M. Esposito, J. Kim, and V. Kumar, “Adaptive RRTs for Validating Hybrid Robotic
Control Systems,” in Algorithmic Foundations of Robotics, Springer, 2005, pp. 107–121
(cit. on p. 74).
43. M. Everett, G. Habibi, C. Sun, and J. P. How, “Reachability Analysis of Neural
Feedback Loops,” IEEE Access, vol. 9, pp. 163 938–163 953, 2021 (cit. on p. 194).
44. M. Everett, G. Habibi, C. Sun, and J. P. How, “Reachability Analysis of Neural
Feedback Loops,” IEEE Access, vol. 9, pp. 163 938–163 953, 2021 (cit. on p. 197).
45. M. Forets and C. Schilling, “LazySets.jl: Scalable Symbolic-Numeric Set Compu-
tations,” Proceedings of the JuliaCon Conferences, vol. 1, no. 1, pp. 1–11, 2021 (cit. on
p. 147).
46. K. Forsberg and H. Mooz, “The Relationship of System Engineering to the Project
Cycle,” Center for Systems Management, vol. 5333, 1991 (cit. on p. 5).
47. R. Geirhos, J.-H. Jacobsen, C. Michaelis, et al., “Shortcut Learning in Deep Neural
Networks,” Nature Machine Intelligence, vol. 2, no. 11, pp. 665–673, 2020 (cit. on
p. 229).
48. J. W. Gelder, “Air Law: The Federal Aviation Act of 1958,” Michigan Law Review,
vol. 57, no. 8, pp. 1214–1227, 1959 (cit. on p. 5).
49. B. Goodman and S. Flaxman, “European Union Regulations on Algorithmic Decision-
Making and a ‘Right to Explanation’,” AI Magazine, vol. 38, no. 3, pp. 50–57, 2017
(cit. on p. 226).
50. M. A. Gosavi, B. B. Rhoades, and J. M. Conrad, “Application of Functional Safety in
Autonomous Vehicles Using ISO 26262 Standard: A Survey,” in SoutheastCon, 2018
(cit. on p. 5).
51. U. Grenander and M. I. Miller, “Representations of Knowledge in Complex Sys-
tems,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 56, no. 4,
pp. 549–581, 1994 (cit. on p. 99).
52. A. Griewank and A. Walther, Evaluating Derivatives: Principles and Techniques of
Algorithmic Differentiation, 2nd ed. SIAM, 2008 (cit. on p. 103).
53. H. Hansson and B. Jonsson, “A Logic for Reasoning about Time and Reliability,”
Formal Aspects of Computing, vol. 6, pp. 512–535, 1994 (cit. on p. 23).
54. P. E. Hart, N. J. Nilsson, and B. Raphael, “A Formal Basis for the Heuristic Determi-
nation of Minimum Cost Paths,” IEEE Transactions on Systems Science and Cybernetics,
vol. 4, no. 2, pp. 100–107, 1968 (cit. on p. 80).
55. W. K. Hastings, “Monte Carlo Sampling Methods Using Markov Chains and Their
Applications,” Biometrika, vol. 57, no. 1, pp. 97–97, 1970 (cit. on p. 94).
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
references 269
56. C. Hensel, S. Junges, J.-P. Katoen, T. Quatmann, and M. Volk, “The Probabilistic
Model Checker Storm,” International Journal on Software Tools for Technology Transfer,
pp. 1–22, 2022 (cit. on p. 210).
57. J. Herkert, J. Borenstein, and K. Miller, “The Boeing 737 MAX: Lessons for Engi-
neering Ethics,” Science and Engineering Ethics, vol. 26, pp. 2957–2974, 2020 (cit. on
p. 7).
58. M. D. Hoffman, A. Gelman, et al., “The No-U-Turn Sampler: Adaptively Setting
Path Lengths in Hamiltonian Monte Carlo.,” Journal of Machine Learning Research
(JMLR), vol. 15, no. 1, pp. 1593–1623, 2014 (cit. on p. 102).
59. A. L. Hunkenschroer and A. Kriebitz, “Is AI Recruiting (Un)ethical? A Human
Rights Perspective on the Use of AI for Hiring,” AI and Ethics, vol. 3, no. 1, pp. 199–
213, 2023 (cit. on p. 6).
60. M. Huth and M. Ryan, Logic in Computer Science: Modelling and Reasoning about
Systems. Cambridge University Press, 2004 (cit. on pp. 30, 31).
61. K. Ishikawa and J. H. Loftus, “Introduction to Quality Control,” in Springer, 1990,
vol. 98, ch. 1 (cit. on p. 3).
62. V. S. Iyengar, J. Lee, and M. Campbell, “Q-EVAL: Evaluating Multiple Attribute
Items Using Queries,” in ACM Conference on Electronic Commerce, 2001 (cit. on p. 29).
63. L. Jaulin, M. Kieffer, O. Didrit, and É. Walter, Interval Analysis. Springer, 2001 (cit.
on p. 171).
64. P. Jorion, “Risk Management Lessons from Long-Term Capital Management,” Euro-
pean Financial Management, vol. 6, no. 3, pp. 277–300, 2000 (cit. on p. 7).
65. H. Kahn and T. E. Harris, “Estimation of Particle Transmission by Random Sam-
pling,” National Bureau of Standards Applied Mathematics Series, vol. 12, pp. 27–30,
1951 (cit. on p. 138).
66. G. K. Kamenev, “An Algorithm for Approximating Polyhedra,” Computational Math-
ematics and Mathematical Physics, vol. 4, no. 36, pp. 533–544, 1996 (cit. on p. 162).
67. S. Karaman and E. Frazzoli, “Incremental Sampling-Based Algorithms for Optimal
Motion Planning,” Robotics Science and Systems VI, vol. 104, no. 2, pp. 267–274, 2010
(cit. on p. 79).
68. H. Karloff, Linear Programming. Springer, 2008 (cit. on p. 164).
69. S. M. Katz, K. D. Julian, C. A. Strong, and M. J. Kochenderfer, “Generating Probabilis-
tic Safety Guarantees for Neural Network Controllers,” Machine Learning, vol. 112,
pp. 2903–2931, 2023 (cit. on p. 218).
70. J. Kleinberg, S. Mullainathan, and M. Raghavan, “Inherent Trade-Offs in the Fair
Determination of Risk Scores,” in Innovations in Theoretical Computer Science (ITCS)
Conference, 2017 (cit. on p. 7).
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
270 references
71. M. J. Kochenderfer and T. A. Wheeler, Algorithms for Optimization. MIT Press, 2019
(cit. on pp. 60, 246).
72. M. J. Kochenderfer, T. A. Wheeler, and K. H. Wray, Algorithms for Decision Making.
MIT Press, 2022 (cit. on pp. 2, 10).
73. A. Kossiakoff, S. M. Biemer, S. J. Seymour, and D. A. Flanigan, Systems Engineering
Principles and Practice. John Wiley & Sons, 2020 (cit. on p. 2).
74. J. Kuchar and A. C. Drumm, “The Traffic Alert and Collision Avoidance System,”
Lincoln Laboratory Journal, vol. 16, no. 2, p. 277, 2007 (cit. on p. 6).
75. M. Kwiatkowska, G. Norman, and D. Parker, “PRISM 4.0: Verification of Proba-
bilistic Real-Time Systems,” in International Conference on Computer Aided Verification,
2011 (cit. on p. 210).
76. S. LaValle, “Planning Algorithms,” Cambridge University Press, vol. 2, pp. 3671–3678,
2006 (cit. on p. 69).
77. R. Lee, M. J. Kochenderfer, O. J. Mengshoel, and J. Silbermann, “Interpretable Cate-
gorization of Heterogeneous Time Series Data,” in SIAM International Conference on
Data Mining, 2018 (cit. on p. 253).
78. R. Lee, O. J. Mengshoel, A. Saksena, et al., “Adaptive Stress Testing: Finding Likely
Failure Events with Reinforcement Learning,” Journal of Artificial Intelligence Research,
vol. 69, pp. 1165–1201, 2020 (cit. on p. 82).
79. K. Leung, N. Aréchiga, and M. Pavone, “Backpropagation Through Signal Temporal
Logic Specifications: Infusing Logical Structure into Gradient-Based Methods,” The
International Journal of Robotics Research, vol. 42, no. 6, pp. 356–370, 2023 (cit. on
p. 41).
80. N. G. Leveson and C. S. Turner, “An Investigation of the Therac-25 Accidents,”
Computer, vol. 26, no. 7, pp. 18–41, 1993 (cit. on p. 6).
81. C. Liu, T. Arnon, C. Lazarus, C. Strong, C. Barrett, and M. J. Kochenderfer, “Algo-
rithms for Verifying Deep Neural Networks,” Foundations and Trends in Optimization,
vol. 4, no. 3–4, pp. 244–404, 2021 (cit. on p. 194).
82. F. Llorente, L. Martino, D. Delgado, and J. Lopez-Santiago, “Marginal Likelihood
Computation for Model Selection and Hypothesis Testing: an Extensive Review,”
SIAM Review, vol. 65, no. 1, pp. 3–58, 2023 (cit. on pp. 126, 129).
83. S. Lloyd, “Least Squares Quantization in PCM,” IEEE Transactions on Information
Theory, vol. 28, no. 2, pp. 129–137, 1982 (cit. on p. 249).
84. K. Makino and M. Berz, “Taylor Models and Other Validated Functional Inclusion
Methods,” International Journal of Pure and Applied Mathematics, vol. 4, no. 4, pp. 379–
456, 2003 (cit. on p. 180).
85. O. Maler, “Computing Reachable Sets: An Introduction,” French National Center of
Scientific Research, pp. 1–8, 2008 (cit. on p. 154).
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
references 271
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
272 references
102. L. Rierson, Developing Safety-Critical Software: a Practical Guide for Aviation Software
and DO-178C Compliance. CRC Press, 2017 (cit. on p. 5).
103. C. P. Robert and G. Casella, Monte Carlo Statistical Methods. Springer, 1999, vol. 2 (cit.
on pp. 90, 94).
104. R. T. Rockafellar and S. Uryasev, “Optimization of Conditional Value-at-Risk,”
Journal of Risk, vol. 2, pp. 21–42, 2000 (cit. on p. 24).
105. W. W. Royce, “Managing the Development of Large Software Systems: Concepts
and Techniques,” IEEE WESCON, 1970 (cit. on p. 5).
106. S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 4th ed. Pearson,
2021 (cit. on p. 205).
107. R. Seidel, “Convex Hull Computations,” in Handbook of Discrete and Computational
Geometry, Chapman and Hall, 2017, pp. 687–703 (cit. on p. 156).
108. L. S. Shapley, “Notes on the N-Person Game—II: The Value of an N-Person Game,”
1951 (cit. on p. 236).
109. C. Sidrane, A. Maleki, A. Irfan, and M. J. Kochenderfer, “OVERT: An Algorithm
for Safety Verification of Neural Network Control Policies for Nonlinear Systems,”
Journal of Machine Learning Research, vol. 23, no. 117, pp. 1–45, 2022 (cit. on pp. 189,
190).
110. J. Siegel and G. Pappas, “Morals, Ethics, and the Technology Capabilities and
Limitations of Automated and Self-Driving Vehicles,” AI & Society, vol. 38, no. 1,
pp. 213–226, 2023 (cit. on p. 7).
111. S. Sigl and M. Althoff, “M-Representation of Polytopes,” ArXiv:2303.05173, 2023
(cit. on p. 156).
112. K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep Inside Convolutional Networks:
Visualising Image Classification Models and Saliency Maps,” in International Con-
ference on Learning Representations (ICLR), 2014 (cit. on p. 234).
113. A. Sinha, M. O’Kelly, R. Tedrake, and J. C. Duchi, “Neural Bridge Sampling for
Evaluating Safety-Critical Autonomous Systems,” Advances in Neural Information
Processing Systems (NeurIPS), vol. 33, pp. 6402–6416, 2020 (cit. on pp. 126, 136).
114. D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg, “Smoothgrad: Remov-
ing Noise by Adding Noise,” in International Conference on Machine Learning (ICML),
2017 (cit. on p. 234).
115. E. Soroka, M. J. Kochenderfer, and S. Lall, “Satisfiability.jl: Satisfiability Modulo
Theories in Julia,” Journal of Open Source Software, vol. 9, no. 100, p. 6757, 2024 (cit.
on p. 206).
116. C. A. Strong, H. Wu, A. Zeljic, et al., “Global Optimization of Objective Functions
Represented by ReLU Networks,” Machine Learning, vol. 112, pp. 3685–3712, 2023
(cit. on p. 197).
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
references 273
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
274 index
132. H. Zhang, T.-W. Weng, P.-Y. Chen, C.-J. Hsieh, and L. Daniel, “Efficient Neural
Network Robustness Certification with General Activation Functions,” Advances in
Neural Information Processing Systems (NeurIPS), vol. 31, 2018 (cit. on p. 197).
133. Y.-D. Zhou, K.-T. Fang, and J.-H. Ning, “Mixture Discrepancy for Quasi-Random
Point Sets,” Journal of Complexity, vol. 29, no. 3-4, pp. 283–301, 2013 (cit. on p. 75).
134. G. M. Ziegler, Lectures on Polytopes. Springer Science & Business Media, 2012, vol. 152
(cit. on p. 155).
135. A. Zutshi, J. V. Deshmukh, S. Sankaranarayanan, and J. Kapinski, “Multiple Shoot-
ing, CEGAR-Based Falsification for Hybrid Systems,” in International Conference on
Embedded Software, 2014 (cit. on p. 67).
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
Index
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com
index 277
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-10-24 19:58:01-07:00, comments to bugs@algorithmsbook.com