ROB501 Textbook2022 03 21
ROB501 Textbook2022 03 21
ROB501 Textbook2022 03 21
Mathematics
for Robotics
Jessy Grizzle
Director, Michigan Robotics
Cover design by Dan Newman, Head of Communications, Michigan Robotics
Composed as lecture notes and piloted between August 2014 and December 2018 for use in ROB 501, Mathematics for Robotics.
Preface 5
2 Some Highlights of Abstract Linear Algebra (or Practicing Proofs in a Safe Environment) 21
2.1 Fields and Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Linear Combinations and Linear Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Basis Vectors and Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Representations of Vectors and the Change of Basis Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6 Linear Operators and Matrix Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.7 Eigenvalues, Eigenvectors, and Diagonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.8 A Few Additional Properties of Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3 Abstract Inner Product Spaces for a Clear Vision of Deterministic Least Squares Problems 41
3.1 Preliminaries on Norms and Normed Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Inner Product Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3 Gram Schmidt Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Projection Theorem and the Normal Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5 Relations between Symmetric and Orthogonal Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.6 Quadratic Forms, Positive Definite Matrices, and Schur Complements . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.7 Least Squares Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3
5 Enough Probability and Estimation to Understand the Kalman Filter 85
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.1.1 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.1.2 Suggested Online Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.1.3 (Optional Read) Probability Spaces Provide a Means to Formalize the Theory of Probability . . . . . . . . . 86
5.2 First Pass on Probability Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2.1 Densities and Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2.2 Random Vectors and Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.3 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.3.1 Best Linear Unbiased Estimator (BLUE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.3.2 Minimum Variance Estimator (MVE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.4 Second Pass on Probability Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.4.1 Marginal Densities, Independence, and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.4.2 Conditional Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.4.3 (Optional Read) Derivation of the Conditional Density Formula from the Conditional Distribution: . . . . . . 100
5.5 Important Facts about Gaussian Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.6 Conditioning with Gaussian Random Vectors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.7 Discrete-time Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.7.1 Model and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.7.2 Basic Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.7.3 Preliminaries for the Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.7.4 Filter Derivation Using Induction and Properties of Conditional Distributions of Gaussian Random Vectors 107
5.7.5 Combined Update Version of the Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.7.6 (Optional Read) Extended Kalman Filter or EKF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.8 (Optional Read) Luenberger Observer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.9 (Optional Read) Information Matrix of Gaussian Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.10 (Optional Read) Deriving MVE as we did BLUE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6 Enough Real Analysis to Understand the Existence of Limits of Sequences as well as Extrema of Functions 117
6.1 Open and Closed Sets in Normed Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.2 Newton-Raphson Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.3 Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.4 Cauchy Sequences and Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.5 Contraction Mapping Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.6 Continuous Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.7 Compact Sets and the Existence of Extrema of Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4
Preface
This collection of course notes is dedicated to all the students who have taken ROB 501 Mathematics for Robotics at the University
of Michigan. The first class of students in Fall 2014 took on one of two tasks: typesetting in Latex a lecture or creating a numerically
oriented HW problem in MATLAB. For their dedication, enthusiasm, and patience with the pilot version of the course, I tip my hat to
Pedro Fernando, Jimmy, Vittorio, Kevin, Yuxiao, Katelyn, Omar, Ross, Jakob, Xianan, Mohammad, Jeffrey, Bo, Kurt, Josh, Xiangyu,
Sulbin, Connie, Meghan, Su-Yang, Katie, Il Sun, Mia, Yong, Yunxiang, Hiroshi, Yevgeniy, Ming-Yuan, Pengcheng, and Zhiyuan. I
have taken your work and expanded it into these notes. Thank you!
Jessy Grizzle
Winter Term, 2022
5
6
Philosophy of the Course
The Robotics Graduate Program was designed in 2013 and launched in 2014. The vision was to take in students from all engi-
neering backgrounds, plus many STEM backgrounds, such as Mathematics and Physics, and prepare M.S. and Ph.D. students for
fruitful careers in Industry and Academia. We knew the incoming students would have very heterogeneous preparation in Math-
ematics, Programming, and Hardware. To form them into a more homogeneous cohort, two first-semester courses were created,
ROB 501 Mathematics for Robotics and ROB 550 Robotics Laboratory1 . After taking these two courses, the students would have
adequate preparation to take courses in Sensing, Acting and Reasoning, as laid out here https://robotics.umich.edu/
academic-program/courses/. The Robotics Graduate Program has worked remarkably well due to the enthusiasm of the
students, staff, aand faculty.
I accepted the charge of creating the mathematical fundamentals course, ROB 501. I interviewed the Robotics faculty in Spring
2014 to find out key topics they wanted covered. Over Summer 2014, I pared the list down and organized it into a coherent set of
topics that would mostly meet the needs as identified by the faculty. To the set of topics enumerated by my colleagues, I added one
additional goal, to break the students’ reliance on example-based (e.g., physics-based) reasoning, and instill in them the ability to
conduct abstract reasoning. The ability to abstract problems from a physical instantiation to a mathematical formulation provides
clarity of thinking as well as being the first step in making problems amenable to algorithmic solutions. Once a student has written
a problem down mathematically, they are able to search the literature for potential solutions, or understand that they have stumbled
upon a new question that requires new investigations.
To be clear, breaking a student’s reliance on physics-based reasoning really means augmenting their reasoning ability, it means giv-
ing them confidence to deal with abstract formulations of problems. The key word here is confidence. ROB 501 does this by doing
proof after proof in every lecture. By practice and repetition, the student learns to read (relatively) complex mathematical statements,
negate them, and otherwise appreciate common mathematical symbols as a precise and effective form of shorthand. To make the
repetition of definitions and proofs tenable, they first practice them in the friendly environment of linear algebra. We take advantage
of the fact that the vast majority of the students taking the course will be comfortable with matrix-vector arithmetic and MATLAB.
While Chapter 1 introduces mathematical notation and proof techniques, these topics are not really learned until they are exercised
in Chapter 2 with an abstract treatment of linear algebra, with everything proved, either in class or HW. Chapter 3 adds more abstrac-
tion, inner product spaces, but starts to provide a pay off, namely deterministic least squares problems. The students start to see how
concrete optimization problems can be solved easily and precisely with the normal equations, and through HW sets, that “practical”
problems involving functions can be solved with no essential changes to the methods. In addition, as a setup for the Kalman Filter,
we derive a recursive solution to the standard (batch) least squares problem, giving us RLS.
Chapter 4 is meant as a mental break. While students are completing HW sets and/or exams on the previous material, we use Gram
Schmidt to develop the QR Factorization, and we use orthogonal matrices and eigenvectors to derive the SVD and pause to illustrate
its practical importance. While typing up these notes, I added in the LU Factorization, in part because I used it as a building block in
the undergrad course, ROB 101, Computational Linear Algebra.
Chapter 5 sets the stage for minimum variance estimation, or non-deterministic least squares problems. We build slowly because if
there is one topic with which a typical first-semester engineering graduate student struggles, its probability. The students themselves
are not to blame for this, nor are their undergraduate instructors. The topic is fundamentally very challenging. As soon we leave the
confines of discrete random variables (with a finite number of possible outcomes) and enter the realm of continuous random variables,
we have a choice: go down the measure theory rabbit hole or fake it. To be sure, we fake it. To clear our conscience, we are upfront
about it and explain the restrictions we make to allow probabilities of events to be computed via an integral, that is, via a density
and Riemann integration. This allows random vectors to be defined and the important quantities, the mean of a random vector, its
covariance, and its variance. Armed with these concepts, we revisit a standard least squares problem and add a measurement noise
1 While ROB 550 was initially charged with preparing students in the areas of embedded programming, inverse kinematics, camera calibration, path planning,
encoders, motor drives, and more, later, in 2019, ROB 502 Programming for Robotics was created to relive some of the burden from ROB 550.
7
model, leading to the Best Linear Unbiased Estimator (BLUE), and then a measurement noise model and a stochastic model for the
unknown, x, leading to the Minimum Variance Estimator (MVE). From here, the goal is the Kalman filter. To get there, conditional
probability is needed, leading to a treatment of conditional normal random vectors, which, once distilled to a set of key facts, yields
the Kalman filter.
Chapter 6 is motivated by considering the tools required to understand the convergence of sequences and the existence of minima
and maxima of functions. It’s also a last pass through our proof techniques, because Chapters 4 and 5 were proof-light, being more
focused on important applications of the math we had already done. We introduce the basic notions of topology, open and closed sets,
in the context of normed spaces. Open and closed sets can be characterized with the notion of distance and through the convergence
of sequences, showing the interplay of things that may not seem so related at a student’s first glance. The contraction mapping prin-
ciple is covered as are standard results on compact sets and continuous functions on compact sets. This provides general conditions
that are easy to check in many engineering situations for the existence of extrema of continuous real-valued functions. These notes
conclude with brief remarks on convex functions, QPs, and LPS for minimizing the one-norm and the max-norm.
8
Chapter 1
Learning Objectives
• Establish notation
• Cover basic proof techniques, some of which may be a review
Outcomes
• Learn how to read mathematical statements such as “∀ ϵ > 0, ∃ δ > 0 such that ∀ x ∈ Bδ (x0 ), ||f (x) − f (x0 )|| < ϵ. ” This is
the definition of a function being continuous at a point x0 , by the way.
• Learn how to negate mathematical statements such as the above.
• Review or learn methods of proofs that we will use on a daily basis in the course.
• In particular, overcome reluctance to use “proof by contradiction”.
9
1.1 Mathematical Notation
N = {1, 2, 3, · · · } Natural numbers or counting numbers
∃ means “for some", “there exist(s)”, “there is/are”, “for at least one”
∼ denotes “logical not”. You will also often see ¬. We’ll use both in these notes.
p =⇒ q means “if the logical or mathematical statement p is true, then the statement q is true”
p ⇐⇒ q means “p is true if, and only if , q is true”. While p iff q is another way to write p ⇐⇒ q, we will mostly avoid
using it in these notes.
p ⇐⇒ q is logically equivalent to
(a) p =⇒ q and
(b) q =⇒ p
The contrapositive of p =⇒ q is ∼ q =⇒ ∼ p
The converse of p =⇒ q is q =⇒ p. It is very important to note that in general, (p =⇒ q) DOES NOT IMPLY (q =⇒ p),
and vice-versa. If they did, we would not need p ⇐⇒ q.
Relation: (p =⇒ q) ⇔ (∼ q =⇒ ∼ p). The two statements are logically equivalent, and hence in principle, we do not need
both of them. We will see, however, that sometimes one of them is easier to use in a proof than the other.
Logical and: p1 ∧ p2 is read p1 and p2 . It is true when both p1 and p2 are true.
Logical or: p1 ∨ p2 is read p1 or p2 . It is true when at least one of p1 and p2 is true. We do not use “exclusive or” in this course.
Hence, if T stands for true and F for false, then T ∨ T = T ∨ F = F ∨ T = T.
Q.E.D. or QED is an abbreviation of the Latin “quod erat demonstrandum” which means “thus it was demonstrated”; it is used
to alert the reader that a proof has been completed. Nowadays, you more frequently see □ or ■ instead of QED.
Warning: In the beginning, it is quite frequent that students confuse the meanings of contrapositive and converse. Just be careful.
With practice, it becomes second nature. The fact that they both start with “con-” is not helpful!
1.2 Vocabulary
The following definitions borrow liberally from Math 299, Michigan State University, https://users.math.msu.edu/
users/duncan42/AxiomNotes.pdf
Meanings:
10
• Lemma: A true statement used in proving other true statements (that is, a less important theorem that is helpful in the proof of
other results).
• Claim: A true statement that is sometimes made as a step toward proving a theorem, in which case it is similar to a lemma.
It is also sometimes used to highlight a property of a mathematical object that a reader might miss. For example, one might
define a matrix A to be invertible if there exists a second matrix B such that A · B = B · A = I, where I is the identity matrix.
Nowhere in this definition have we explicitly stated that A must be square. Hence, you might then Claim: An invertible matrix
is square. And then prove the claim by using the rules (i.e., definition) of matrix multiplication.
• Corollary: A true statement that is a “simple” deduction from a theorem or proposition.
• Proof : The explanation of why a statement is true.
• Conjecture: A statement believed to be true, but for which we have no proof. (A statement that is being proposed to be a true
statement).
• Axiom: A basic assumption about a mathematical situation, or said another way, a statement Mathematicians assume to be
true. One example comes from Euclid “ two distinct parallel lines in the plane never intersect”. This is a fundamental tenet of
Euclidean geometry and is false in geometries where lines are curved. Another example is the axiom that for any two integers
a and b, the symbol “+” called sum has the property a + b = b + a, that is, the sum of a and b is equal to the sum of b and
a. In this sense, axioms and definitions are very similar. Axioms are reserved for the “bedrock” definitions in a logical or
mathematical system, the definitions on which everything else is based.
Definition 1.1 An integer n is even if n = 2k for some integer k; n is odd if n = 2k + 1 for some integer k.
Remark 1.2 In a definition, the convention is that “if” means “if, and only if”. It would be considered bad style (form) to write the
above definition as follows: An integer n is even if, and only if n = 2k for some integer k; n is odd if, and only if n = 2k + 1 for
some integer k. What you have to understand is that the two meanings are identical when used in a definition. In a lemma, theorem,
claim, proposition, corollary, etc., the two meanings are very different. Yes, this takes time before it becomes second nature.
Example 1.3 Provide a direct proof that the sum of two odd integers is even.
Proof: Let n1 and n2 be odd integers. Then by the definition of odd, there exist integers k1 and k2 such that
n1 = 2k1 + 1
n2 = 2k2 + 1.
Because k1 + k2 + 1 is the sum of three integers, it is also an integer, and therefore 2(k1 + k2 + 1) is by definition, an even integer.
Because n1 + n2 = 2(k1 + k2 + 1), it is even. ■
Did we really need to say the last line in the proof, namely, “Because n1 + n2 = 2(k1 + k2 + 1), it is even”? It’s a matter of taste. A
proof is supposed to convince a skeptical reader that something is true. Sometimes, when writing a paper, you do have to restate the
obvious to ensure that a reviewer does not miss it.
11
1.3.2 Proof by Contrapositive:
A proof by contrapositive means that to establish p =⇒ q, we prove its logical equivalent, ∼ q =⇒ ∼ p. Once you have written
down what is ∼ q and what is ∼ p, the proof often proceeds like a direct proof.
Proof: In the beginning of your proof writing career, it is highly recommended to explicitly write down what is p and what is q. This
helps you to understand what you are trying to show and what is/are the given hypothesis(eses).
Our proof of p =⇒ q is to show ∼ q =⇒ ∼ p, that is, if n is odd, then n2 is odd. Hence, let n be an odd integer. By the definition
of n being odd, there exists an integer k such that n = 2k + 1. Therefore,
On occasion, we will do a proof by exhaustion. For us, four cases will already be a lot!
First Principle of Induction (Standard Induction): Let P (n) denote a statement about the natural numbers with the following
properties:
Remark 1.5 Suppose the base case involves an integer k0 ̸= 1. Then you can re-index things and reduce it to the base case having
k0 = 1. Alternatively, you assume that P (k) true for k ≥ k0 implies P (k + 1) is true, and then you get P (n) is true for all n ≥ k0 .
A common mistake is to not use the correct base case. For an example, you should read about how “to prove by induction” that all
horses are the same color https://en.wikipedia.org/wiki/All_horses_are_the_same_color.
Proof:
12
• Step 1: Check the base case, P (1): For k = 1, we have that 1 = 12 , and hence the base case is true.
• Step 2: Show the induction hypothesis is true. That is, using the fact that P (k) is true, show that P (k + 1) is true. Often, this
involves re-writing P (k + 1) as a sum of terms that show up in P (k) and another term.
For us,
P (k + 1) : 1 + 3 + 5 + · · · + (2k − 1) + (2(k + 1) − 1) = (k + 1)2 .
For the induction step, we assume that
P (k) := 1 + 3 + 5 + · · · + (2k − 1) = k 2
is true and thus P (k + 1) is true if, and only if
k 2 + (2(k + 1) − 1) = (k + 1)2 .
k 2 + (2(k + 1) − 1) = k 2 + 2k + 2 − 1 = k 2 + 2k + 1 = (k + 1)2 ,
and hence P (k + 1) is true. Because we have shown that P (1) is true and for all k ≥ 1, P (k) =⇒ P (k + 1), by the Principle of
Induction, we conclude that for all k ≥ 1,
1 + 3 + 5 + · · · + (2k − 1) = k 2 .
■
Remark 1.7 When the base case P (1) seems so totally trivial that you are unsure whether there is anything to show, it’s OK to check
P (2), just to convince yourself that it is true too. In our case you would check P (2) : 1 + 3 = 22 ; since this is easily established
to be true, you may now be more confident of your proof. If you do one more, P (3) : 1 + 3 + 5 = 33 , you are now on a roll and
ready to attack the general case by induction. Bottom line: In the beginning, it’s natural to be tentative when you do a proof. It takes
practice to learn the art of proving things. When you write a proof, we will not take off points for doing a bit more work than is
strictly required. We understand that you are slowly building up your confidence.
Warning: When you seek to establish the induction hypothesis, that P (k + 1) =⇒ P (k), you need to make sure that your rea-
soning works for all k, including the base case, k = 1. In the infamous “all horse are the same color proof (spoof)”, the statement
P (1) =⇒ P (2) fails. People overlook it by starting at P (2) =⇒ P (3) or they make a mistake on P (1) =⇒ P (2). Oooops!
Second Principle of Induction (Strong Induction): Let P (n) be a statement about the natural numbers with the following proper-
ties:
(a) Base Case: P (1) is true.
(b) Induction hypothesis: If P (j) is true for all 1 ≤ j ≤ k, then P (k + 1) is true.
Then P (n) is true for all n ≥ 1 (or, n greater than or equal to the n0 used in the Base Case).
Remark 1.8 You can see why it is sometimes called Strong Induction: we have access to all of the logical statements P (j) up to,
and including, P (k), when we are trying to prove the induction step P (k + 1) is true. We will show a bit later that the two principles
of induction are logically equivalent. Nevertheless, sometimes one method is easier to apply than the other.
Definition 1.9 A natural number n ≥ 2 is composite if it can be factored as n = a · b, where a and b are natural numbers satisfying
1 < a, b < n. Otherwise, n is prime.
Remark 1.10 It follows from the above definition that a natural number greater than or equal to two is either prime or composite,
and it cannot be both. The first prime number is 2. What is the number 1?
Example 1.11 Let’s prove the Theorem: (Fundamental Theorem of Arithmetic) Every natural number n ≥ 2 can be factored as a
product of one or more primes.
Proof:
13
• Step 0: We write down the statements. For k ≥ 2, P (k): there exist ik ≥ 1 and prime numbers p1 , p2 , . . . , pik such that the
product p1 · · · pik = k.
• Step 1: Check the base case, P (2): For k = 2, we have that 2 = 2, and hence the base case is true.
• Step 2: Show the induction hypothesis is true. That is, using the fact that P (j) is true for 1 ≤ j ≤ k, show that P (k + 1) is
true, that is, k + 1 can be expressed as a product of primes. There are two cases:
(a) Case 1: k + 1 is prime. In this case, we are done because k + 1 is already the product of one prime, namely itself.
(b) Case 2: k + 1 is composite. Then, there exist two natural numbers a and b, 2 ≤ a, b ≤ k, such that k + 1 = a · b.
Because a and b are natural numbers that are greater than or equal to 2 and less than or equal to k, by the induction step:
■
Strong Induction was useful here because we needed to “reach back” and use our statements P (j) for values of j not equal to k. If
we relied only on Ordinary Induction, we would have been stuck. This raises the question again, is Strong Induction really more
powerful than Ordinary Induction? The answer is NO, it is just sometimes more convenient.
Equivalence of Strong and Ordinary Induction: Let P (k) be the set of logical statements that are used with Strong Induction.
Then the induction step is equivalent to
because we assume that P (j) is true for 1 ≤ j ≤ k. Next, you can note that (1.1) is equivalent to
If we return to our proof of the Fundamental Theorem of Arithmetic, then what is Q(k)? Well, Q(k) true means that P (1), P (2), . . . , P (k)
are all true, and hence
Q(k) : for all integers 1 ≤ j ≤ k, there exist primes p1 , p2 , . . . , pij , such that j = p1 · p2 · · · pij .
The proof of the induction step Q(k) =⇒ Q(k + 1) is then identical to the proof we gave for P (1) ∧ · · · ∧ P (k) =⇒ P (k + 1).
Initially, students tend to dislike proofs by contradiction. It seems to be a contradiction to logical thinking that such a method of
proof can even be correct! A statement that is both true and false is called a contradiction. The basis for a proof by contradiction is
that if you start with only true statements and correctly apply the rules of logic, you cannot generate a contradiction. Hence, if you
start with a statement, say p, and through valid application of the rules of logic, arrive at a second statement, say R that is both T and
F, then the statement p must be false. We can write this as
p =⇒ (∃ R such that R ∧ (∼ R) = T) ⇐⇒ p = F.
14
Only false statements can generate contradictions.
The game plan for a proof by contradiction: We want to show that a statement p is true. We assume instead that the statement is
false. On the basis of p being false, we derive a “contradiction”, meaning some statement R that we show to be both true and false.
We conclude that ∼ p is false, because it led to a contradiction. Hence, p must be true.
Because both m and n are even, they have 2 as a common factor, which is a contradiction to m and n have no common factors.
√ √
Because we arrived at this contradiction from the statement “ 2 is rational”, we deduce that “ 2 is rational” must be false. Hence,
√
2 is irrational. ■.
• We now have ∼ R = T.
• Hence, (R ∧ (∼ R)) = T, which is a contradiction.
• Conclusion: ∼ p = T is impossible, and therefore ∼ p = F .
√
• Hence, p = T and we have proved that 2 is irrational. Pretty cool!
Remark 1.14 You can also use proof by contradiction to prove p =⇒ q. What you do is start with p ∧ (∼ q), that is, you assume p
is true and ∼ q is true. This gives you an easy way of generating a statement that you believe to be false, and hence should lead to a
contradiction.
15
1.3.6 Summary:
In conclusion, we have the following proof techniques.
• Direct Proof: p =⇒ q
• Proof by Contrapositive: ∼ q =⇒ ∼ p. (Start with the conclusion being false, that is ∼ q and do logical steps to arrive at ∼ p)
• Proof by Contradiction Version I: To show that p is a true statement, you assume instead that ∼ p is true and seek a statement
R such that both R and ∼ R are true, which is a contradiction. Deduce that ∼ p = F and hence p = T.
• Proof by Contradiction Version II: Start with p ∧ (∼ q) (assume p is true and q is false. Find a statement R such that both R
and ∼ R are true, which is a contradiction. Deduce that ∼ (p ∧ (∼ q)) = T, that is, (p ∧ (∼ q)) = F, and hence, if p = T then
q = T. That is, p =⇒ q.
• Proof by Exhaustion, where we enumerate a finite set of cases and check them one by one.
A rookie mistake is to prove “both” p =⇒ q and its contrapositive , ∼ q =⇒ ∼ p. The problem here is that ∼ q =⇒ ∼ p
is logically equivalent to p =⇒ q and hence all you have done is prove the same thing two different ways, which is not what
you wanted to do!
p ¬p
T F
F T
Here is a truth table for p ∧ q, which has two inputs, p and q, and one output p ∧ q.
p q p∧q
T T T
T F F
F T F
F F F
Here is a truth table for p =⇒ q using proof by contradiction. The table has two inputs p and q, and one output ∼ (p ∧ (∼ q)), the
last column. We include several intermediate columns required to compute the output. The input and output are highlighted in blue.
p q ¬q p ∧ ¬q ¬(p ∧ ¬q)
T T F F T
T F T T F
F T F F T
F F T F T
16
It is remarkably easy to check that the above table is correct because it only involves logical negation and logical and. The hard part
is accepting that (p =⇒ q) ⇐⇒ ¬(p ∧ ¬q).
Now we give you a partially filled truth table for p =⇒ q, which has two inputs, p and q, and one output p =⇒ q and ask you to
complete it.
p q p =⇒ q
T T T
T F F
F T ?
F F ?
Remark 1.15 Here is one way to think about it that I once found on the web and have lost the link: We define p : (you score 100%
on all exams and assignments) and q : (I assign you an A+ grade for the course). We all agree that p =⇒ q must be a correct
statement. In fact, it should be so deep into the bedrock of grading that we can call it an axiom! It’s just that until now, you maybe
never thought to complete a truth table for it! So, consider the following four scenarios, which correspond to the four lines in the
truth table:
p = you score 85% and q = your grade is an A+ , then (p =⇒ q) = ? (ask yourself, does this invalidate the statement?)
p = you score 85% and q = your grade is an A− , then (p =⇒ q) = ? (ask yourself, does this invalidate the statement?)
To get the answers, look at the truth table of ¬(p ∧ ¬q). If this makes your head spin, don’t worry about it. After we do a bunch
of proofs, you can come back to this riddle. The reasoning is, once you fail to live up to the 100% standard, then p = F. Then, no
matter what grade I give you, I am not invalidating the promise, p =⇒ q. Hence, the questions marks cannot be F. Because we are
using binary logic, ¬F = T. Therefore, the question marks must be replaced with T. Pretty cool, right?
Solution:
17
Example 1.17 Let y ∈ R and define p : ∀ δ > 0, ∃x ∈ Q such that |x − y| < δ. Determine its negation. Recall, Q is the set of
rational numbers.
Solution:
• Math form: p : ∀ δ > 0, ∃x ∈ Q such that |x − y| < δ
• Word form: p : for all δ > 0, there exists x ∈ Q such that |x − y| < δ
• Negate: ∼ p : not (for all δ > 0, there exists x ∈ Q such that |x − y| < δ)
• Equivalent: ∼ p : for some δ > 0, not[there exists x ∈ Q such that |x − y| < δ]
• Equivalent: ∼ p : for some δ > 0, there does not exist x ∈ Q such that |x − y| < δ
• Equivalent: ∼ p : there exists δ > 0, such that for all x ∈ Q, not[|x − y| < δ]
• Equivalent: ∼ p : there exists δ > 0, such that for all x ∈ Q, |x − y| ≥ δ
• Math form: ∼ p : ∃ δ > 0, ∀x ∈ Q, |x − y| ≥ δ
■
When negating statements with ∃ and ∀, here are the logical equivalents of their negations:
∼ ∀ ⇐⇒ ∃
∼ ∃ ⇐⇒ ∀
Remark 1.18 It is better to avoid ̸ ∀ and ̸ ∃, but they are legal symbols.
Definition 1.20 An element b ∈ A is a maximum of A if x ≤ b for all x ∈ A. We note that in the definition, b must be an element of
A. We denote it by
max A or max{A}.
It is very important to note that a max of a set may not exist! Indeed, the set A = {x ∈ R | 0 < x < 1} does not have a maxi-
mum element. We will see later that it does not have a minimum either. This is what motivates the notions of supremum and infimum.
Definition 1.21 An element b ∈ R is an upper bound of A if x ≤ b for all x ∈ A. We say that A is bounded from above.
Remark 1.22 We note that in the definition of upper bound, b does NOT have to be an element of A.
(b) if b ∈ R satisfies x ≤ b for all x ∈ A, then b∗ ≤ b. (This means that there is no other upper bound that is strictly smaller than
b∗ .)
Notation and Vocabulary 1.24 The least least upper bound of A is also called the supremum of A and is denoted
sup A or sup{A}
18
Key Theorem If A ⊂ R is bounded from above, then sup{A} exists in R.
Remark 1.25 The rational numbers Q do not satisfy the above property, namely, sets that are bounded from above may not have a
supremum within the rational numbers. Let A ⊂ Q be defined as
1 √
A := {x ∈ Q | ∃ n ≥ 1, x + < 2}.
n
The number 1.5 ∈ Q is an upper bound of A, but there is no rational
√ number that is a least upper bounded. Of course, viewed as
a subset of the real numbers, A has a least upper bound, namely 2. This is a subtle but important difference. In fact, one can
construct the real numbers from the rational numbers as the “smallest set that (i) contains the rationals and (ii) all subsets that are
bounded from above have a least upper bound, that is, a supremum.
Example 1.26
Remark 1.27 (Dejà Remark All Over Again) The existence of a supremum is a special property of the real numbers. If you are
working within the rational numbers, an upper bounded set may not have a rational supremum. If you view the set as a subset of the
reals, it will then have a supremum, but the supremum may be an irrational number.
Example 1.28
• A = {x ∈ Q | 0 < x < 1}. Then sup A = 1. Indeed, 1.0 is a rational number, it is an upper bound, and it less than or equal
to any other upper bound; hence it is the supremum.
• A = {x ∈ Q | x2 ≤ 2}. Then (1.42)2 = 2.0164, and thus b = 1.42 is a rational upper bound. Also (1.415)2 = 2.002225, and
thus b = 1.1415 is a smaller rational upper bound. However, there is no least upper bound within the set of rational numbers.
When we view √ the set A as being a subset of the real numbers, then there is a real number that is a least upper bound and we
have sup A = 2. This is what we mean when we say that the existence of a supremum is a special or distinguishing property
of the real numbers.
Everything we have done above can be repeated with minimum replacing maximum, greatest lower bound replacing least upper
bound, and infimum replacing supremum. Consider once again a set A ⊂ R.
Definition 1.29 An element b ∈ A is a minimum of A if b ≤ x for all x ∈ A. We note that in the definition, b must be an element of
A. We denote it by min A or min{A}.
It is important to note that a min of a set may not exist! Indeed, the set A = {x ∈ R | 0 < x < 1} does not have a minimum element.
Definition 1.30 An element b ∈ R is a lower bound of A if b ≤ x for all x ∈ A. We say that A is bounded from below.
Remark 1.31 We note that in the definition of lower bound, b does NOT have to be an element of A.
19
Definition 1.32 An element b∗ ∈ R is the greatest lower bound of A if
Notation and Vocabulary 1.33 The greatest lower bound of A is also called the infimum and is denoted
inf A or inf{A}
Key Theorem If the set A is bounded from below, then inf A exists.
Example 1.34
Remark 1.35 If the infimum is in the set A, then it is equal to the minimum.
Definition 1.36 If a set A ⊂ R is not bounded from above, we define sup A = +∞. If A ⊂ R is not bounded from below, we define
inf A = −∞. The symbols +∞ and −∞ are not real numbers. The extended real numbers are sometimes defined as
Re := {−∞} ∪ R ∪ {+∞}.
Remark 1.37 (Final) Michigan’s MATH 451 constructs the real numbers from the rational numbers. This is a very cool process to
learn. Unfortunately, we do not have the time to go through this material in ROB 501. The existence of supremums and infimums for
bounded subsets of the real numbers is a consequence of how the real numbers are defined (i.e., constructed).
20
Chapter 2
Learning Objectives
• Establish key definitions for fields, vectors, vector spaces, linear combinations, linear independence, subspaces, basis vectors,
linear operators, matrix representations, eigenvalues and eigenvectors.
• Initial practice with the proof techniques of the previous chapter.
Outcomes
• Understand a very general notion of vector spaces, where the vectors range from the standard column vectors in Rn to matrices
and functions.
• Understand how the notion of representing a vector with respect to a set of basis vectors establishes one-to-one relations with
columns of numbers that can be used in numerical computations.
21
2.1 Fields and Vector Spaces
Mostly in this course, we work with the real numbers or the complex numbers. However, in this Chapter, as part of our push to tear
you away from the familiar and confront you with more abstraction than you may have seen before, we define a general field, F.
When you are first learning the abstract definition, it is good to keep a “ canonical example” in mind, and that would be F = R.
Definition 2.1 (Chen, 2nd edition, page 8) : A field consists of a set, denoted by F, of elements called scalars and two operations
called addition “+” and multiplication “·”; the two operations are defined over F such that they satisfy the following conditions:
1. To every pair of elements α and β in F, there correspond an element α + β in F called the sum of α and β, and an element
α · β (or simply αβ) in F called the product of α and β.
(α + β) + γ = α + (β + γ) (α · β) · γ = α · (β · γ)
α · (β + γ) = (α · β) + (α · γ)
5. F contains an element, denoted by 0, and an element, denoted by 1, such that α + 0 = α and 1 · α = α for every α in F.
6. To every α in F, there is an element β in F such that α + β = 0. The element β is called the additive inverse.
7. To every α in F which is not the element 0, there is an element γ in F such that α · γ = 1. The element γ is called the
multiplicative inverse.
To show a given set is a field, you must check all seven axioms. To show something is not a field, you only need to show that one of
the axioms fails.
Definition 2.2 (Chen 2nd Edition, page 9) A vector space (or, linear space) over a field F, denoted by (X , F), consists of a set,
denoted by X , of elements called vectors, a field F, and two operations called vector addition and scalar multiplication. The two
operations are defined over X and F such that they satisfy all the following conditions:
1. To every pair of vectors v 1 and v 2 in X , there corresponds a vector v 1 + v 2 in X , called the sum of v 1 and v 2 1 .
4. X contains a vector, denoted by 0, such that 0+v = v for every v in X . The vector 0 is called the zero vector or the origin.
6. To every α in F, and every v in X , there corresponds a vector α · v in X called the scalar product of α and v.
22
8. Scalar multiplication is distributive with respect to vector addition: For any α in F and any v 1 , v 2 in X , α · v 1 + v 2 =
α · v1 + α · v2 .
9. Scalar multiplication is distributive with respect to scalar addition: For any α, β in F and any v in X , (α + β)·v = α·v +β ·v.
1. Every field forms a vector space over itself. (F, F). Examples: (R, R), (C, C), (Q, Q).
2. (C, R), meaning X = C, F = R, is a vector space. This works because a real scalar α ∈ F = R times a complex number
v ∈ X = C yields a complex number α · v ∈ X = C.
Note that we are using the known properties of addition and multiplication of real numbers to define what we mean by the sum
of two functions and the product of a real number and a function. To prove that (X , F) is a vector space, you must check all
ten of the axioms. We’ll check #8 to illustrate what as to be done: we need to show that scalar multiplication is distributive
with respect to vector addition:
α · (f + g) = α · f + α · g.
Our method of proof is to show that the left hand side (LHS) equals the right hand side (RHS) when
Let t ∈ D, then
5. In a similar manner, one can take X := F n×m = {n × m matrices with coefficients in F} with vector addition and scalar
times vector multiplication defined in the obvious ways. Hence, matrices can be viewed as vectors.
6. (R, Q) is a vector space. It is not particular useful in Robotics. It’s kind of cool to think about. The set of vectors is R. Clearly,
the sum any two real numbers is again a real number, and hence a vector. The usual scalar times vector multiplication works
because the product of a rational number and a real number is a real number, and hence is a vector.
23
Non-Example 2.4
1. Take X = R, F = C. Then (X , F) := (R, C) fails the definition of scalar multiplication (and others).
3. (Q, R) is a not vector space. The set of vectors is Q. Clearly, the sum any two rational numbers is again a rational number,
and hence a vector. The usual scalar times vector multiplication fails, however, because the product of a real number and a
rational number is a real number, and hence is not a vector.
2.2 Subspaces
Notation 2.5 Let A and B be sets. Then
• (A ⊂ B) ⇐⇒ (a ∈ A =⇒ a ∈ B)
• (A = B) ⇐⇒ (A ⊂ B and B ⊂ A)
In ROB 501, we do not use the notion of A being a strict subset of B, which in some books is denoted as A ⊊ B or A ⫋ B. Please
do not use such notation in your HWs or exams.
Definition 2.6 Let (X , F) be a vector space, and let Y be a subset of X . Then Y is a subspace if using the rules of vector addition
and scalar multiplication defined in (X , F), we have that (Y, F) is a vector space.
Remark 2.7 To show that a set is a subspace using the definition, you have to check axioms 1 to 10. To show that a set is NOT a
subspace, you just need to show that one of the axioms is violated. The first thing you should always check is that 0 ∈ Y.
Proposition 2.8 (Tools to check that something is a subspace) Let (X , F) be a vector space and Y ⊂ X . Then, the following are
equivalent (TFAE):
Because (a) through (d) are equivalent, you can use any of (b) through (d) to show that (a) is true.
Example 2.9
2 β
• (X , F) := (R , R) and Y := β ∈ R ⊂ X . For v 1 , v 2 ∈ Y, we have
2β
β1 β2 β1 + β2
+ = ∈Y
2β1 2β2 2(β1 + β2 )
| {z } | {z }
v1 v2
24
• (X , F) , F = R, where X = {f : R → R}, and we define
is a subspace because we know that the sum of two polynomials with real coefficients is another polynomial with real co-
efficients, and the product of a real number and a polynomial with real coefficients is once again a polynomial with real
coefficients.
d
• In a similar manner, you can check that Ye := f : R → R | f is differentiable , dt f ≡ 0 is a subspace of X .
2 β 1
Non-Example 2.10 We take (X , F) := (R , R) and define Y := + β ∈ R ⊂ X . Then 0 ̸∈ Y and hence Y is
2β 0
not a subspace. In a similar manner, Yb := {f : R → R | f (2) = 1.0} is not a subspace because it does not contain the zero vector,
that is, a function that is zero for all t ∈ R.
Definition 2.12 A finite set of vectors {v 1 , . . . , v k } is linearly dependent if ∃α1 , . . . αk ∈ F not all zero such that α1 v 1 + α2 v 2 +
. . . + αk v k = 0. Otherwise, the set is linearly independent.
Remark 2.13 Suppose {v 1 , . . . , v k } is a linearly dependent set. Then, ∃α1 , . . . , αk are not all zero such that
α1 v 1 + α2 v 2 + . . . + αk v k = 0.
Suppose αk ̸= 0. Then,
and therefore v k is a linear combination of the set {v 1 , . . . , v k−1 }. Using this observation, you can prove the following.
The equivalence of (a) and (c) motivates the following definition for linear independence of sets that may have an infinite number of
elements.
Definition 2.15 An arbitrary set of vectors S ⊂ X is linearly independent if every finite subset is linearly independent.
Example 2.16 Let F = R and X = P(t) = { set of polynomials with real coefficients }. Prove the following Claim: The monomials
are linearly independent. In particular, for each n ≥ 0, the set {1, t, . . . , tn } is linearly independent.
Direct Proof: Suppose that p(t) := α0 +α1 t+· · ·+αn tn = 0 is the zero polynomial. We need to show that α0 = α1 = · · · = αn = 0.
k
From Calculus, we know that a polynomial of degree n is identically zero if, and only, if, p(0) = 0 and d dtp(t) k |t=0 = 0 for
25
k = 1, 2, . . . , n. Armed with this notion, we check that
0 = p(0) =⇒ α0 = 0
dp(t)
0= |t=0 = (α1 + 2α2 t + · · · + nαn tn−1 )|t=0 =⇒ α1 = 0
dt
d2 p(t)
0= |t=0 = (2α2 + 6α3 t + · · · + n(n − 1)αn tn−2 )|t=0 =⇒ α2 = 0
dt2
..
.
dn p(t)
0= |t=0 = n!αn =⇒ αn = 0.
dtn
Proof by Induction: Let k ≥ 0, and define the property P (k) by
P (k) : {1, t, . . . , tk } is linearly independent.
Base Case: P (0) is true; that is, the set {1} is linearly independent. You can check this on your own.
Induction Step: For k ≥ 0, we assume that P (k) is true and we must show that P (k + 1) is true.
By a previous remark, it is enough to show that tk+1 cannot be written as a linear combination of {1, t, . . . , tk }. Suppose to the
contrary that there exist α0 , α1 , . . . , αk such that
tk+1 = α0 + α1 t + · · · + αk tk .
Appealing to basic Calculus, we differentiate both sides k + 1 times and arrive at
(k + 1)! = 0,
which is clearly false. Hence tk+1 cannot be written as a linear combination of {1, t, . . . , tk } and therefore P (k + 1) is true.
For {v 1 , v 2 , v 4 }, we form a linear combination and seek coefficients resulting in the zero vector,
α1 0 0 α 0 0 0 0 0 α1 + α2 0 0 0 0 0
α1 v 1 + α2 v 2 + α4 v 4 = 02×3 ⇐⇒ + 2 + = = .
2α1 0 0 0 0 0 α4 0 0 2α1 + α4 0 0 0 0 0
One then checks that α1 = 1, α2 = −1, α4 = −2 is a nontrivial solution. There are infinite number of other nontrivial solutions. We
only needed to find one to show that {v 1 , v 2 , v 4 } is a linearly dependent set of vectors.
■
Example 2.18 Let F = R and X = R2×3 , the set of 2 × 3 matrices with real coefficients. Are the vectors
1 0 4 4 1 0
A1 = , A2 =
3 −1 2 6 0 6
linearly independent?
26
Solution: We form a linear combination of A1 and A2 and check for a nontrivial solution.
α1 + 4α2 =0
α2 =0
α1 + 4α2 α2 4α1 0 0 0 4α1 =0
α1 A1 + α2 A2 = = ⇐⇒
3α1 + 6α2 −α1 2α1 + 6α2 0 0 0
3α1 + 6α2 =0
−α1 =0
2α1 + 6α2 =0
We wrote out all of the equations to emphasize that when you set a 2 × 3 matrix equal to the zero matrix, each of its entries must be
zero. We only have two unknowns, so we could have gotten by with just noting that the second and fifth equations together imply
that α1 = α2 = 0. We conclude that the set {A1 , A2 } is linearly independent. ■
Remark 2.19 The √ field F is important when determining whether a set is linearly independent or not. For example, let X = C and
v 1 = 1, v 2 = j := −1. v 1 and v 2 are linearly independent when F = R. However, they are linearly dependent when F = C.
Definition 2.20 Let S be a subset of a vector space (X , F). The span of S, denoted span{S}, is the set of all linear combinations of
elements of S. That is,
span{S} := {x ∈ X | ∃n ≥ 1, α1 , . . . , αn ∈ F, v 1 , . . . , v n ∈ S, s.t. x = α1 v 1 + · · · + αn v n }.
Remark 2.21 By construction, span{S} is closed under linear combinations. Hence, it is a subspace of X .
Example 2.22 Let F = R and X = {f : R → R}.
(a) What is the span of S = {1, t, t2 , . . . }?
(b) Is et ∈ span{S}?
Solution:
(a) span{S} = span{tk , k ≥ 0} = P(t), the set of polynomials in t with real coefficients.
(b) et ̸∈ span{S} because et is not a polynomial. Even though et has a Taylor series expansion, the number of components in
that sum is infinite, while, by definition, a linear combination has to be finite. Is it possible to write et in a different way as a
d t
polynomial? The answer is no. From Calculus, we know that dt e = et . Moreover, because differentiation of a polynomial
d
reduces its degree by one, there is no non-zero polynomial p(t) that satisfies dt p(t) = p(t).
■
27
(b) {je1 , je2 , . . . , jen } is also a basis for (Cn , C).
(c) {v 1 , v 2 , . . . , v n } is also a basis for (F n , F), where
1 1 1
0 1 1
v1 = . , v2 = . , . . . , vn = . .
.. .. ..
0 0 1
(a) {e1 , e2 , . . . , en } is NOT a basis for (Cn , R) because span{e1 , e2 , . . . , en } ̸= Cn . Indeed, je1 ̸∈ span{e1 , e2 , . . . , en } when
the field is the real numbers!
(b) The set {e1 , e2 , . . . , en , je1 , je2 , . . . , jen } is not linearly independent in (Cn , C) and is therefore not a basis of (Cn , C).
Definition 2.27 (X , F) is infinite dimensional if for every n > 0, there is a linearly independent set with n or more elements in it.
Remark 2.28 Because subspaces are vectors spaces in their own right, the above definitions apply to assigning their dimensions.
By convention, the subspace consisting of the zero vector, {0}, has dimension zero.
(a) dim(F n , F) = n.
(b) dim(Cn , R) = 2n.
(c) dim(P(t), R) = ∞.
(d) dim(R, Q) = ∞.
Remark 2.30 It is common to define the dimension of a vector space (or subspace) as the cardinality of a basis. We provided
bases for examples (a), (b), and (c) in Example 2.24. But what about (d)? For (d), an explicit basis cannot be written down.
You can find proofs online, based on other types of arguments, that show dim(R, Q) = ∞, such as http://www2.math.ou.
edu/~aroche/courses/LinAlg-Fall2011/solutions1.pdf. The vector space (R, Q) does not have any particular
importance in Robotics. It’s just cool every once in a while to think about non-intuitive ideas, such as creating an infinite dimensional
vector space where the vectors belong to a set that your intuition is screaming “these must be one dimensional objects”!
Theorem 2.31 Let (X , F) be an n-dimensional vector space (always means n is finite). Then, any set of n linearly independent
vectors is a basis.
Proof: Let {v 1 , . . . , v n } be a linearly independent set in (X , F). To show it is a basis, we need to show the “span” property, namely,
∀x ∈ X , ∃ α1 , . . . , αn ∈ F such that x = α1 v 1 + · · · + αn v n .
Let x ∈ X be given. Because (X , F) is n-dimensional, the set {x, v 1 , . . . , v n } is a linearly dependent set, because, otherwise,
dim(X ) > n. Hence, ∃ β0 , β1 , . . . , βn ∈ F, NOT ALL ZERO, such that β0 x + β1 v 1 + · · · + βn v n = 0.
Claim 2.32 β0 ̸= 0.
28
Proof : We do a proof by contradiction. Suppose that β0 = 0. Then,
(a) and (b) imply that {v 1 , . . . , v n } is a linearly dependent set. However, by assumption, {v 1 , . . . , v n } is a basis and hence linearly
independent. This is a contradiction. Hence, β0 = 0 cannot hold. □
β0 x = −β1 x1 − . . . − βn v n
⇓
−β1 −βn
x= v1 + . . . + vn
β0 β0
−β1 −βn
and therefore, α1 := β0 , . . . , αn := β0 are the required coefficients in F. ■
Proposition 2.33 Let (X , F) be a vector space with basis {v 1 , . . . , v n } and let x ∈ X . Then, there exist unique coefficients
α1 , . . . , αn such that
x = α1 v 1 + α2 v 2 + . . . + αn v n .
α1 − β1 = 0, . . . , αn − βn = 0.
Proposition 2.34 Let(X , F) be an n-dimensional vector space and let {v 1 , · · · , v k } be a linearly independent set with k strictly less
than n. Then, ∃ v k+1 ∈ X such that {v 1 , · · · , v k , v k+1 } is linearly independent.
Proof: We use proof by contradiction. Suppose that {v 1 , · · · , v k } is a linearly independent and k < n, but no such v k+1 ex-
ists. Then, ∀x ∈ X , x ∈ span{v 1 , · · · , v k }, and therefore, X ⊂ span{v 1 , . . . , v k }. This in turn implies that n = dim(X ) ≤
dim(span{v 1 , . . . , v k }) = k, which contradicts k < n. Hence, there must exist v k+1 ∈ X such that {v 1 , . . . , v k , v k+1 } is linearly
independent. ■
Corollary 2.35 In a finite dimensional vector space, any linearly independent set can be completed to a basis. More precisely, let
{v 1 , · · · , v k } be linearly independent, n = dim(X ) and k < n. Then, ∃v k+1 , . . . , v n such that {v 1 , · · · , v k , v k+1 , . . . , v n } is a
basis for X .
29
Remark 2.37 Just to be absolutely 100% clear, we note
α1
α2
[x]v := . ⇐⇒ x = α1 v 1 + α2 v 2 + · · · + αn v n .
..
αn
Once a basis is specified, you can work with vectors as if they were n-tuples. This can be handy when doing numerical computations.
Solution:
Basis 1: This one can be done by inspection because the basis is so simple:
5
5 3 3
x= = 5v 1 + 3v 2 + 1v 3 + 4v 4 1 ∈R .
⇐⇒ [x]w = 4
1 4
4
0 0 1 1 α4 4
[αx]v = α[x]v
30
v
3. Once a basis v := {v 1 , . . . , v n } is chosen , an n-dimensional vector space (X , F) ←→ (F n , F).
Question Let {u1 , · · · , un } and {u1 , · · · , un } be two bases for (X , F). Is there a relation between [x]u and [x]u ? There is, via the
change of basis matrix.
Theorem
2.40 There exists an invertible matrix P , with coefficients in F, such that ∀x ∈ (X , F), [x]u = P [x]u , where, P =
P1 P2 · · · Pn and its ith column is given by Pi := [ui ]u ∈ F n , and [ui ]u is the representation of ui with respect to u.
Similarly, there exists an invertible matrix P = P 1 P 2 · · · P n with P i =[ui ]u , the representation of ui with respect to u,
and P · P = P · P = I.
Proof: We can express x ∈ X in terms of both bases, x = α1 u1 + · · · + αn un = α1 u1 + · · · + αn un , so that
α1 α1
α2 α2
α := . := [x]u and α := . = [x]u
.. ..
αn αn
From the linearity of the representation operation,
n
hX i n
X n
X
α := [x]u = αi ui = αi [ui ]u = αi Pi = P α. (2.1)
u
i=1 i=1 i=1
yielding α = P α. Combining (2.1) and (2.2) gives α = P P α and α = P P α. Because this holds for all x, and hence for all α =
and α, we deduce P P = P P = I.
Example 2.41 For F = R and X = R2×2 , find the change of basis matrices.
1 0 0 1 0 0 0 0
u= , , ,
0 0 0 0 1 0 0 1
1 0 0 1 0 1 0 0
u= , , ,
0 0 1 0 −1 0 0 1
Solution: We have following relations:
α = P α, Pi = [ui ]u , α = P α, P i = [ui ]u
−1
P = P, P −1 = P
Typically, one computes the easier of P and P , and then computes the other by matrix inversion. For this example, we choose to
compute P because the required representations can be computed by inspection.
1 0
0 1
P 1 = [u1 ]u =
0 P 2 = [u2 ]u =
1
0 0
0 0
1 0
P 3 = [u3 ]u =
−1 P 4 = [u4 ]u =
0
0 1
31
1 0 0 0 1 0 0 0
0 1 1 0
and P = (P )−1 = 0 .5 .5 0
Therefore, P =
0 1 −1 0 0 .5 −.5 0
0 0 0 1 0 0 0 1
1
0 1 0 1 0 0 1 0 1 0 0
P1 = [u1 ]u =
↔ =1· +0· +0· +0·
0 0 0 0 0 1 0 −1 0 0 1
0
0
.5 0 1 1 0 0 1 0 1 0 0
P2 = [u2 ]u =
↔ =0· + 0.5 + .5 +0·
.5 0 0 0 0 1 0 −1 0 0 1
0
0
.5 0 0 1 0 0 1 0 1 0 0
P3 = [u3 ]u =
↔ =0· + 0.5 − .5 +0·
−.5 1 0 0 0 1 0 −1 0 0 1
0
0
0
P4 = [u4 ]u = ↔ 0 0 =0· 1 0 +0· 0 1 +0· 0 1
+ 1 ·
0 0
0 0 1 0 0 1 0 −1 0 0 1
1
1 0 0 0 1 0 0 0
0 .5 .5 0
and P = P −1 = 0 1 1 0
Therefore, P =
.
0 .5 −.5 0 0 1 −1 0
0 0 0 1 0 0 0 1
■
Definition 2.44 Let (X , F) and (Y, F) be finite dimensional vector spaces, and L : X → Y be a linear operator. A matrix
1 m 1 n
representation of L with respect to a basis
u := {u , . . . , u } for X and v := {v , . . . , v } for Y is an n × m matrix A, with
coefficients in F, such that ∀x ∈ X , L(x) v = A x u .
32
1 m
Theorem 2.45 Let (X , F) and (Y, F) be finite dimensional vector spaces, L : X → Y a linear operator,
u := {u ,th. . . , u } a basis
1 n
for X and v := {v , . . . , v } a basis for Y, then L has a matrix representation A = A1 · · · Am , where the i column of A is
given by
Ai := L(ui ) v , 1 ≤ i ≤ m.
Using linearity
L(x) = L(α1 u1 + · · · + αm um )
= α1 L(u1 ) + · · · + αm L(um ).
Hence, computing representations, we have
[L(x)]v = [α1 L(u1 ) + · · · + αm L(um )]v
= α1 [L(u1 )]v + · · · + αm [L(um )]v
= α1 A1 + · · · + αm Am
α1
α2
= A1 A2 ··· Am .
..
αm
=A x u.
Hence, L(x) v = A x u . ■
Example 2.46 F = R, X = P3 (t) = {polynomials of degree ≤ 3}, and Y = P3 (t). Use the same basis on X and Y, namely
d
u := v := {1, t, t2 , t3 } and define L : X → Y by L(p) := dt p. Find A, the matrix representation of L, which will be a 4 × 4 real
matrix.
Solution: We compute A column by column, where A = A1 A2 A3 A4 . Then,
0 1
0 0
A1 = L(1) {1,t,t2 ,t3 } =
0 A2 = L(t) {1,t,t2 ,t3 } =
0
0 0
0 0
2 0
A3 = L(t2 ) {1,t,t2 ,t3 } A4 = L(t3 ) {1,t,t2 ,t3 }
=
0
=
3
0 0
and thus
0 1 0 0
0 0 2 0
A=
0
.
0 0 3
0 0 0 0
33
To check that it makes sense, we let
p(t) = a0 + a1 t + a2 t2 + a3 t3
and note that
a0
a1
p(t) {1,t,t2 ,t3 } =
a2
a3
0 1 0 0 a0 a1
0 0 2 0 a1 2a2
0 0 0 3 a2 = 3a3
A[p(t)]{1,t,t2 ,t3 } =
0 0 0 0 a3 0
Does this correspond to differentiating the polynomial p(t)? From Calculus, we have that
d
p(t) = a1 + 2a2 t + 3a3 t2
dt
a1
d 2a2
[ p(t)]{1,t,t2 ,t3 } =
3a3
dt
0
and thus, yes indeed,
d
A[p(t)]{1,t,t2 ,t3 } = [ p(t)]{1,t,t2 ,t3 } .
dt
■
1 n 1 n
Example 2.47 Let (X , F) be a finite dimensional vector space, with bases u := {u , . . . , u } and v := {v , . . . , v }. Let L : X →
X be the identity operator, that is, ∀ x ∈ X , L(x) := Id(x) := x. Then the matrix representation of L is the change of basis matrix.
Solution: For x ∈ X , we write x = α1 u1 + · · · + αn un so that its representation is
α1
α2
x u = . ∈ F m.
..
αn
As in Theorem 2.45, we define
Ai = L(ui ) v , 1 ≤ i ≤ n.
Using linearity
[L(x)]v = α1 [L(u1 )]v + · · · + αn [L(um )]v
= α1 A1 + · · · + αn An
α1
α2
= A1 A2 ··· An .
..
αn
=A x u.
Hence, ∀ x ∈ X ,
L(x) v = A x u
⇕
Id(x) v = A x u
⇕
x v=A x u
⇕
A = P the change of basis matrix.
34
■
Remark 2.48 A shorter proof is just to note that the i-th column of the change of basis matrix satisfies Pi := [ui ]v , and that the
i-th column of the matrix representation of L satisfies Ai := [L(ui )]v = [ui ]v when L is the identity operator. The relationship is
summarized in Fig. 2.1.
(a) (b)
Figure 2.1: Diagram chasing. Commuting diagrams used to illustrate the matrix representation of a linear operator and the change of
basis matrix. When X = Y and L = Id, they become one and the same. Hence, there is really only one idea to remember.
2 2 2 1
Example 2.49 Let (X , F) = R , R , and define L : R → R by L (e1 ) = 3e1 + 4e2 , L (e2 ) = −e1 + 6e2 , where e1 = and
0
0
e2 = are the canonical basis elements.
1
Solution:
(a) Let A =matrix representation of L. Then the ith column of A = [L (ei )]{e1 ,e2 } . Hence,
3
[L (e1 )]{e1 ,e2 } =
4
−1
[L (e2 )]{e1 ,e2 } =
6
3 −1
=⇒ A = .
4 6
(b) Let P be the change of coordinates from {e1 , e2 } to v 1 , v 2 , and P be the change of coordinates from v 1 , v 2 to {e1 , e2 }.
Note that the ith column of P is just the representation of v i in {e1 , e2 }. That is,
1 3
P = .
1 −4
Recall that P = P −1 , so
−1 −1 −4 −3 1 4 3
P = P = = .
7 −1 1 7 1 −1
Therefore, if A is the representation of L in v 1 , v 2 , then
1 4 3 3 −1 1 3 1 −38 16
A = P AP −1 = = .
7 1 −1 4 6 1 −4 7 −8 25
35
Note: P was readily available, not P , as you may have guessed!! Just to check, let’s do the same thing the “long way”:
L v 1 = L (e1 + e2 )
= L (e1 ) + L (e2 )
= (3e1 + 4e2 ) + (−e1 + 6e2 )
= 2e1 + 10e2
2
L v = L (3e1 − 4e2 )
= 3L (e1 ) − 4L (e2 )
= 3 (3e1 + 4e2 ) − 4 (−e1 + 6e2 )
13e1 − 12e2
L v1 {v 1 ,v 2 }
=? To find it, write
2 1 3 1 3 a11
= a11 +a21 =
10 1 −4 1 −4 a12
|{z} | {z }
v1 v2
a11 1 38
=⇒ =
a12 7 −18
Similarly
13 1 3
= a12 + a22
−12 1 −4
a12 1 16
=⇒ =
a22 7 25
and hence
1 38 16
A= .
7 −8 25
Eigenvectors are not unique because if Av = λv, then for all α ̸= 0, A(αv) = λ(αv), and thus αv is also an e-vector. To find
eigenvalues, we need to know conditions under which ∃v ̸= 0 such that Av = λv.
0 1
Example 2.51 Find the e-values and e-vectors for A = .
−1 0
0 1
Solution: A = =⇒ det(λI − A) = λ2 + 1 = 0. Therefore, the eigenvalues are λ1 = j, λ2 = −j. To find eigenvectors,
−1 0
we need to solve (A − λi I)v i = 0. The eigenvectors are
1 1 2 1
v = ,v = .
j −j
Note that both eigenvalues and eigenvectors are complex conjugate pairs. ■
36
Definition 2.52 ∆(λ) := det(λI − A) is called the characteristic polynomial. ∆(λ) = 0 is called the characteristic equation. By
the Fundamental Theorem of Algebra, ∆(λ) can be factored as
where λ1 , · · · , λp are the distinct eigenvalues (roots), mi is the algebraic multiplicity of λi , and m1 + m2 + · · · + mp = n. The
geometric multiplicity of λi is defined as ηi := dim null(A − λi I).
Theorem 2.53 Let A be an n × n matrix with coefficients in R or C. If the e-values {λ1 , . . . , λn } are distinct, that is, λi ̸= λj for
all 1 ≤ i ̸= j ≤ n, then the e-vectors {v 1 , . . . , v n } are linearly independent in (Cn ,C).
Remark 2.54 Restatement of the theorem: If the e-values {λ1 , . . . , λn } are distinct then {v 1 , . . . , v n } is a basis for (Cn ,C).
Proof: We prove the contrapositive: if {v 1 , . . . , v n } is linearly dependent then there is a repeated e-value (λi = λj for some i ̸= j).
α1 v 1 + · · · + αn v n = 0. (2.3)
Without loss of generality, we can suppose α1 ̸= 0. (that is, we can always reorder of e-values and e-vectors so that the first coeffi-
cient α1 is nonzero).
(A − λj I)v i = Av i − λj v i = λi v i − λj v i = (λi − λj )v i .
for i = 1
for i = 2
..
.
for i = n
0 = (A − λ2 I)(A − λ3 I) · · · (A − λn I)(α1 v 1 + · · · + αn v n )
= α1 (λ1 − λ2 )(λ1 − λ3 ) · · · (λ1 − λn )v 1
and hence, at least one of the terms (λ1 − λk ), 2 ≤ k ≤ n must be zero. Therefore, there is a repeated e-value λ1 = λk for some
2 ≤ k ≤ n. ■
37
2.8 A Few Additional Properties of Matrices
Definition 2.55 Two n × n matrices A and B are similar if there exists an invertible n × n matrix P such that B = P · A · P −1 . P
is called a similarity matrix.
Definition 2.56 An n × n matrix A is said to have a full set of e-vectors if there exists a basis {v 1 , v 2 , . . . , v n } of (Cn , C) such
Av i = λi v i , 1 ≤ i ≤ n.
Theorem 2.57 An n × n matrix A has a full set of e-vectors if, and only if, it is similar to a diagonal matrix. And when this happens,
the entries on the diagonal matrix are e-values of A.
Proof: We assume that {v 1 , . . . , v n } is a basis for (Cn , C) and that for 1 ≤ i ≤ n, Av i = λi v i . Define two n × n matrices
M := v 1 v 2 · · · v n
Λ := diag(λ1 , λ2 , . . . , λn ).
Then
A · M := Av 1 Av 2 Av n
···
= λ1 v 1 λ2 v 2 λn v n
···
= M · Λ.
and hence M is invertible if, and only if, {v 1 , . . . , v n } is linearly independent. Therefore we have
A = M ΛM −1 and Λ = M −1 AM,
The other direction is straightforward and left to the reader. You need to recognize the columns of the “similarity matrix” as being
e-vectors of A. ■
Fact 2.58 If A and B are similar, they have the same e-values. Moreover, the e-values have the same algebraic and geometric
multiplicities.
Definition 2.59 Let A be an n × m matrix with coefficients in R or C. The rank of A is equal to the number of linearly independent
columns.
Fact 2.60 If M is square, then rank(M ) equals the number of non-zero e-values.
Fact 2.61 For an n × m real matrix A, rank(A) = rank(A⊤ A) = rank(AA⊤ ) = rank(A⊤ ). Hence, A⊤ A and AA⊤ have the
same number of non-zero e-values. For a proof, see Chapter 10 of the ROB 101 textbook and see Lemma 2.63 below.
Corollary 2.62
• # of linearly independent rows = # of linearly independent columns.
• rank(A) ≤ min(n, m)
• If λ is a non-zero e-value of AA⊤ with e-vector v, then λ is also an e-value of A⊤ A with e-vector A⊤ v.
38
Proof: We only prove the first statement. Suppose that A⊤ A v = λv and both λ and v are non-zero. Then Av ̸= 0. Next, we form
Corollary 2.64 AA⊤ and A⊤ A have the same non-zero e-values. Because they have different sizes, they may have different number
of zero e-values.
Pn
Definition 2.65 (Trace of a Matrix) Let C be an n × n matrix. Then tr(C) := i−1 Cii .
Fact 2.67 (A lesser known way doing matrix multiplication, the outer product formula.) Suppose that A is n × k and B is k × m
so that the two matrices are compatible for matrix multiplication. Then
k
X
A·B = acol row
i · bi ,
i=1
the “sum of the columns of A multiplied by the rows of B”. A more precise way to say it would be “the sum over i of the i-th column
of A times the i-th row of B.”
Why: To see why this is true, let’s first consider two 2 × 2 matrices A and B, where
a11 a12 b11 b12
A= and B = .
a21 a22 b21 b22
Then, using the standard rows of A times the columns of B formulation of matrix multiplication yields
39
In a similar manner, we can treat the general case,
Fact 2.68 (Matrix Inversion Lemma) Suppose that A, B, C and D are compatible2 matrices. If A, C, and (C −1 + DA−1 B) are
each square and invertible, then A + BCD is invertible and
Remark 2.69 In many important applications, the inverse of A may be already known or easy to compute. Here is a made up
example, but it gets the point across: By hand, evaluate (A + BCD)−1 when
1
0
⊤
2 , C = 0.2, D = B
A = diag([1, 0.5, 0.5, 1, 0.5]), B =
0
3
2 The sizes are such the matrix products and sum in A + BCD make sense.
40
Chapter 3
Learning Objectives
• Learn that a norm is a general means of measuring the length of a vector.
• Learn how the notions of an inner product and an inner product space generalize the dot product on Rn to a wide range of
useful settings.
Outcomes
• Norms and normed spaces as settings where best approximation problems can be posed.
• Learn that orthogonality, the Pythagorean Theorem, and Gram-Schmidt work just as well in abstract inner product spaces as
they do in Rn .
• The “normal equations” provide a systematic means to compute solutions to best approximation problems in any finite-
dimensional inner product space.
• While we are on the subject of “orthogonality”, we’ll dive into real symmetric matrices and see that their eigenvectors have the
amazing property that they can be selected to form an orthonormal basis for Rn .
• We solve two very important least squares problems for overdetermined systems, underdetermined systems, and provide a
recursive solution for the case of overdetermined systems. The latter is meant as a preview of the Kalman Filter (KF).
41
3.1 Preliminaries on Norms and Normed Spaces
Definition 3.1 Let (X , F) be a vector space where the field F is either R or C. A function ∥ · ∥: X → R is a norm if it satisfies
Example 3.2
(a) F := R or C, X := F n .
n 12
2
P
• ∥x∥2 := |xi | , Euclidean norm or 2-norm
i=1
n
p1
p
P
• ∥x∥p := |xi | , 1 ≤ p < ∞, p-norm
i=1
• ∥x∥∞ := max |xi |, max-norm, sup-norm, ∞-norm
1≤i≤n
Notation and Vocabulary 3.4 (Notions of Distance and Best Approximation) For x, y ∈ X ,
• d(x, y) := ∥x − y∥ is called the distance from x to y. We note that d(x, y) = d(y, x).
• If ∃x∗ ∈ S such that d(x, S) = ∥x − x∗ ∥, then x∗ is called a best approximation of x by elements of S. We sometimes write
b for x∗ because we are really thinking of the solution as an approximation.
x
Notation 3.6 (arg min) When x∗ exists and is unique, we write x∗ := arg min∥x − y∥ or x
b := arg min∥x − y∥. It means that x∗ is
y∈S y∈S
the argument of the minimum function that achieves the minimum value.
Remark 3.7
(c) When ∃e
y ∈ S such that ||x − ye|| = inf ∥x − y∥, we can write ||x − ye|| = min∥x − y∥. While it may seem natural to denote
y∈S y∈S
the best y by y ∗ , we typically denote it by x∗ because we are trying to best approximate x by elements of S.
42
(d) x∗ := arg min∥x − y∥ is the value of y ∈ S that best achieves the approximation of x in the sense of minimizing the norm.
y∈S
∗
(e) x := arg min∥x − y∥ only makes sense when inf ∥x − y∥ = min∥x − y∥, that is, the infimum is actually achieved.
y∈S y∈S y∈S
(f) Furthermore, if you really want to be careful, you should also check that there is a unique minimum. Otherwise, the correct
notation is x∗ ∈ arg min∥x − y∥. In engineering publications, you rarely see this much care being taken. C’ést la vie, baby !
y∈S
Figure 3.1: In HW, we’ll introduce the concept of “strict” norms. The || • ||1 is not strict and consequently, the minimum distance
problem (or best approximation problem) does not have a unique answer.
Remark 3.8 Figure 3.1 shows an example having nonunique solutions. Indeed, the set
∀x
b ∈ S, ||x − x
b||1 = 2 = inf ||x − m||1 .
m∈M
||x − x
b||1 = |1 − x̂1 | + |1 + x̂1 |, −1 ≤ x̂1 ≤ 1
= (1 − x̂1 ) + (1 + x̂1 )
=2
43
Exercise 3.9 Consider (Rn , R) with the p-norm. Show that for all x ∈ Rn ,
lim ||x||p = ||x||max .
p→∞
Hints: (i)√Prove the result when ||x||max = 1. (ii) When x ̸= 0, consider x̄ = x/||x||max . (iii) For any non-negative real number a,
p
limp→∞ a = 1.
4. If the field is the real numbers, then the above reduces to ⟨x, β1 y1 + β2 y2 ⟩ = β1 ⟨x, y1 ⟩ + β2 ⟨x, y2 ⟩
Example 3.13 Common inner products:
n
(a) (Cn , C), ⟨x, y⟩ := x⊤ y =
P
xi yi .
i=1
n
(b) (Rn , R), ⟨x, y⟩ := x⊤ y =
P
xi yi .
i=1
44
We’ll now make a choice for λ that minimizes ⟨x, x⟩ − 2λ⟨x, y⟩ + λ2 ⟨y, y⟩. Taking the derivative with respect to λ and setting it
equal to zero yields λ = ⟨x, y⟩/⟨y, y⟩. In case you are curious whether this is a max or min, note that ⟨y, y⟩ > 0 because y ̸= 0.
Therefore, we can conclude that |⟨x, y⟩|2 ≤ ⟨x, x⟩⟨y, y⟩ =⇒ |⟨x, y⟩| ≤ ⟨x, x⟩1/2 ⟨y, y⟩1/2 and the proof is done.
We quickly outline the steps for F = C. Because the inner product of a vector with itself is always a non-negative real number, for
all scalars λ ∈ C,
0 ≤ ⟨x − λy, x − λy⟩ = ⟨x, x⟩ − λ⟨y, x⟩ − λ⟨x, y⟩ + |λ|2 ⟨y, y⟩.
⟨x,y⟩
For the particular choice λ = ⟨y,y⟩ , direct calculation shows
|⟨x, y⟩|2
0 ≤ ⟨x, x⟩ − ,
⟨y, y⟩
which gives
p
|⟨x, y⟩| ≤ ⟨x, x⟩⟨y, y⟩ = ⟨x, x⟩1/2 · ⟨y, y⟩1/2 ,
and the proof is done.
Corollary 3.15 Let (X , F, ⟨· , ·⟩) be an inner product space, with F either R or C. Then,
p
∥x∥ := ⟨x, x⟩1/2 = ⟨x, x⟩
is a norm.
Proof: As before, for clarity of exposition, we first assume F = R. We will only check the triangle inequality ∥x + y∥ ≤ ∥x∥ + ∥y∥,
which is equivalent to showing ∥x + y∥2 ≤ ∥x∥2 + ∥y∥2 + 2∥x∥ · ∥y∥. The other parts are left as an exercise.
∥x + y∥2 := ⟨x + y, x + y⟩
= ⟨x, x + y⟩ + ⟨y, x + y⟩
= ⟨x, x⟩ + ⟨x, y⟩ + ⟨y, x⟩ + ⟨y, y⟩
= ∥x∥2 + ∥y∥2 + 2⟨x, y⟩
≤ ∥x∥2 + ∥y∥2 + 2 |⟨x, y⟩|
≤ ∥x∥2 + ∥y∥2 + 2∥x∥ · ∥y∥
45
We’ll now quickly do the changes required to handle F = C. The triangle inequality is ∥x + y∥ ⩽ ∥x∥ + ∥y∥, which is equivalent
to showing ∥x + y∥2 ⩽ ∥x∥2 + 2∥x∥ ∥y∥ + ∥y∥2 . Brute force computation yields,
∥x + y∥2 = ⟨x + y, x + y⟩
= ⟨x, x + y⟩ + ⟨y, x + y⟩
= ⟨x + y, x⟩ + ⟨x + y, y⟩
= ⟨x, x⟩ + ⟨y, x⟩ + ⟨x, y⟩ + ⟨y, y⟩
= ⟨x, x⟩ + ⟨x, y⟩ + ⟨x, y⟩ + ⟨y, y⟩
= ∥x∥2 + ∥y∥2 + 2Re{⟨x, y⟩}
where Re{⟨x, y⟩} denotes the real part of the complex number ⟨x, y⟩. However, for any complex number α, Re{α} ≤ |α|, and thus
we have
Define
k−1
X ⟨y k , v j ⟩ j
vk = yk − ·v (3.2)
j=1
∥v j ∥2
46
Proof: We first note that from (3.1), v i ̸= 0 for 1 ≤ i ≤ k − 1. Next, the orthogonality of {v 1 , · · · , v k } is essentially by construction.
To see this, we write
k−1
X
vk = yk − aki v i
i=1
⟨y k , v j ⟩
akj = .
∥v j ∥2
Indeed, (
i j 0 j ̸= i, 1 ≤ i, j ≤ k − 1
⟨v , v ⟩ =
||v i ||2 i = j,
and hence
k−1
X
0 = ⟨v k , v j ⟩ = ⟨y k , v j ⟩ − aki ⟨v i , v j ⟩ = ⟨y k , v j ⟩ − akj ⟨v j , v j ⟩ = ⟨y k , v j ⟩ − akj ||v j ||2 .
i=1
From (3.2),
k−1
X ⟨y k , v j ⟩ j
yk = vk + · v =⇒ y k ∈ span{v 1 , · · · , v k }.
j=1
∥v j ∥2
so
k−1
X ⟨v j , y k ⟩
v j ∈ span{y 1 , · · · , y k−1 } ⊂ span{y 1 , · · · , y k }.
j=1
∥v j ∥2
k 1 k
Clearly, y ∈ span{y , · · · , y }. Putting these two facts together,
k−1
X ⟨v j , y k ⟩
vk = yk − v j ∈ span{y 1 , · · · , y k }
j=1
∥v j ∥2
Definition 3.20 The Gram-Schmidt Process consists of initialing (3.1) with v 1 = y 1 and then applying (3.2) recursively. When
implementing the algorithm in code, it is quite easy to normalize the vectors as you go, as in the following pseudocode:
47
Example 3.21 Given the following data in (R3 , R),
1 1 0
{y 1 , y 2 , y 3 } = 1 , 2 , 1 ,
0 3 1
P3
and inner product < p, q >:= pT q = i=1 pi qi , apply Gram-Schmidt to produce an orthogonal basis. Normalize to produce an
orthonormal basis.
Solution:
1
v1 = y1 = 1
0
∥v 1 ∥2 = (v 1 )T v 1 = 2;
< v1 , y2 > 1
v2 = y2 − v
∥v 1 ∥2
1
−2
1 1 1 1 1
= 2 − 1 1 0 2 1 = 2
2
3 3 0
| {z } 3
3
2 2 1 19
∥v ∥ = 9 = ;
2 2
√1
2
v1
ṽ1 = = √1
1
∥v ∥ 2
0
−1
√
38
v2
√1
ṽ2 = =
38
∥v 2 ∥
q
2
3 19
6
− 19
v3 19 6
ṽ3 = 3
=√ 19
∥v ∥ 76
2
− 19
48
R1
Example 3.22 Given the real vector space C[0, 1] = {f : [0, 1] → R | f continuous}, inner product < f, g >:= 0
f (τ )g(τ )dτ ,
and {y 1 , y 2 , y 3 } = {1, t, t2 }, apply Gram-Schmidt to produce an orthogonal basis for span{1, t, t2 }.
Solution:
v1 = y1 = 1
Z 1
1 2
∥v ∥ = (1)2 dτ = 1;
0
< v1 , y2 > 1
v2 = y2 − v
∥v 1 ∥2
Z 1
1 1
=t− 1 · τ dτ · · 1 = t −
1 2
|0 {z }
1
2
Z 1
1 1
∥v 2 ∥2 = (τ − )2 dτ = ;
0 2 12
1 1
= t2 − − t −
3 2
2 1
=t −t+ .
6
■
1 >> clear *
2 >> syms t % declare to be a symbolic variable
3
16 >> v1=y1
17 >> v2=y2-int(v1*y2,0,1)*v1/int(v1^2,0,1)
18 >> v3=y3-int(v1*y3,0,1)*v1/int(v1^2,0,1)-
19 int(v2*y3,0,1)*v2/int(v2^2,0,1)
20
49
23 >> v1_tilde=v1/int(v1^2,0,1)^.5
24 >> v2_tilde=simplify( v2/int(v2^2,0,1)^.5 )
25 >> v3_tilde=simplyfy(v3/int(v3^2,0,1)^.5)
Output
v1=1
v2=t-1/2
v3=t^2+1/6-t
v1_tilde=1
v2_tilde=(t-1/2)*12^(1/2)
v3_tilde=(6*t^2+1-6*t)*5^(1/2)
Remark 3.24 (Round-off Errors Affect Classical Gram-Schmidt) The classical Gram-Schmidt Process in Definition 3.20 is straight-
forward to understand, which is why it is taught in courses. Unfortunately, it behaves poorly under the round-off error that occurs in
digital computations! Here is a standard example:
1 1 1
ε 0 0
u1 = 0 , u2 = ε , u3 = 0 , ε > 0
0 0 ε
Let {e1 , e2 , e3 , e4 } be the standard basis vectors corresponding to the columns of the 4 × 4 identity matrix. We note that
u2 = u1 + ε(e3 − e2 )
u3 = u2 + ε(e4 − e3 )
Example 3.25 Hence, Gram-Schmidt applied to {u1 , u2 , u3 } and {u1 , (e3 − e2 ), (e4 − e3 )} should “theoretically” produce the
same orthonormal vectors. To check this, we go to MATLAB, and for ε = 0.1, we do indeed get the same results. You can verify this
yourself. However, with ε = 10−8 ,
1.0000 0.0000 0.0000
0.0000 −0.7071 −0.7071
Q1 = 0.0000
0.7071 0.0000
0.0000 0.0000 0.7071
1.0000 0.0000 0.0000
0.0000 −0.7071 −0.4082
Q2 =
0.0000 0.7071 −0.4082
0.0000 0.0000 0.8165
where h i
v1 v2 v3
Q1 = ∥v1 ∥ ∥v2 ∥ ∥v3 ∥
has been computed with Classical-Gram-Schmidt for {u1 , (e3 − e2 ), (e4 − e3 )}. Hence we do NOT obtain the same result!
50
end
for k = 1 : n
vk = ∥vvkk ∥
for j = (k + 1) : n
vj = vj − (vj • vk )vk %Makes vj orthogonal to vk
end
end
At Step k=1, v1 is normalized to length one, and then v2 , . . . , vn are redefined to be orthogonal to v1 . At Step k=2: v2 is normalized
to length one, and then v3 , . . . , vn are redefined to be orthogonal to v2 . We note that they were already orthogonal to v1 . At Step k: vk
is normalized to length one, and then vk+1 , . . . , vn are redefined to be orthogonal to vk . We note that they were already orthogonal
to v1 , . . . , vk−1 .
Example 3.27 Hence, if Modified Gram-Schmidt is so great, when applied to {u1 , u2 , u3 } and {u1 , (e3 − e2 ), (e4 − e3 )}, it should
produce the same orthonormal vectors and it does! To check this, we go to MATLAB for ε = 10−8 and obtain
1.0000 0.0000 0.0000
0.0000 −0.7071 −0.7071
Q1 = 0.0000
0.7071 0.0000
0.0000 0.0000 0.7071
1.0000 0.0000 0.0000
0.0000 −0.7071 −0.7071
Q2 = 0.0000
0.7071 0.0000
0.0000 0.0000 0.7071
where Q1 and Q2 are defined above. When one is equipped with the right Algorithm, the world is truly a marvelous place.
Remark 3.28 Just to be perfectly clear, with perfect arithmetic (no rounding errors), Classical Gram-Schmidt and Modified Gram-
Schmidt are equivalent.
Remarks:
(a’) If ∃m0 ∈ M such that ∥x − m0 ∥ = d(x, M ) = inf ∥x − m∥, then m0 is unique. [equivalent to (a)]
m∈M
Proof: (By contrapositive) Assume x − m0 ̸⊥ M . We will produce m1 ∈ M such that ∥x − m1 ∥ < ∥x − m0 ∥. Indeed, suppose
x − m0 ̸⊥ M . Then, ∃m ∈ M such that ⟨x − m0 , m⟩ =
̸ 0. We know m ̸= 0, and hence we define
m
• m̃ = ∥m∥ ∈ M;
• δ := ⟨x − m0 , m̃⟩ =
̸ 0; and
• m1 = m0 + δ m̃ =⇒ m1 ∈ M .
51
The intuition behind the definition of m1 is that x − m1 is “closer” to being perpendicular to M than is x − m0 , and hence it should
follow that ||x − m1 || < ||x − m0 ||. To prove the latter point, we do a few computations:
∥x − m1 ∥2 = ∥x − m0 − δ m̃∥2
= ⟨x − m0 − δ m̃, x − m0 − δ m̃⟩
= ⟨x − m0 , x − m0 ⟩ − δ ⟨x − m0 , m̃⟩ −δ ⟨m̃, x − m0 ⟩ +δ 2 ⟨m̃, m̃⟩
| {z } | {z } | {z }
δ δ =1
= ∥x − m0 ∥2 − δ 2
< ∥x − m0 ∥2
∥x − m∥2 = ∥x − m0 + m0 − m ∥2
| {z }
∈M
= ∥x − m0 ∥ + ∥m0 − m∥2 .
2
It follows that
Definition 3.32 Let (X , F, ⟨·, ·⟩) be an inner product space, and S ⊂ X a subset (does not have to be a subspace).
Exercise 3.33
• Show that S ⊥ is always a subspace.
• If M = span{y 1 , . . . , y k }, show that (x ∈ M ⊥ ) ⇐⇒ ⟨x, y i ⟩ = 0, 1 ≤ i ≤ k.
Proposition 3.34 Let (X , F, ⟨·, ·⟩) be a finite dimensional inner product space and M a subspace of X . Then,
X = M ⊕ M ⊥.
Remark 3.35 Suppose that V and W are subspaces of X . Then V +W := {x ∈ X | x = v +w, for some v ∈ V, w ∈ W }. Because
V and W are subspaces, 0 ∈ V ∩ W (the zero vector is in their intersection). If that is the only vector in the intersection, meaning
V ∩ W = {0}, the zero subspace, then we write V ⊕ W , and it is called the direct sum of V and W . What does the direct sum get
you that an ordinary sum would not? You can show that (x ∈ V ⊕ W ) ⇐⇒ (∃ unique v ∈ V, w ∈ W such that x = v + w).
Proof: If x ∈ M ∩ M ⊥ , then by the definition of M ⊥ , ⟨x, x⟩ = 0, which implies x = 0. Hence, M ∩ M ⊥ = {0}. Next, we need to
show that X = M + M ⊥ , that is, every X ∈ X can be written as a sum of a vector in M and a vector in M ⊥ .
Let {y 1 , . . . , y k } be a basis of M . By Corollary 2.35, it can be completed to a basis for X , that is,
52
We can then apply Gram-Schmidt to produce orthonormal vectors {v 1 , . . . , v k , v k+1 , . . . , v n } such that
⟨x, v i ⟩ = α1 ⟨v 1 , v i ⟩ + · · · + αi ⟨v i , v i ⟩ + · · · + αn ⟨v n , v i ⟩
= αi (because ⟨v j , v i ⟩ = 0, j ̸= i, and ⟨v i , v i ⟩ = 1)
Theorem 3.36 (Classical Projection Theorem) Let (X , R) be a finite dimensional real inner product space and M a subspace of X .
Then, ∀x ∈ X , ∃ unique x
b ∈ M such that
∥x − x
b∥ = d(x, M ) := inf ∥x − m∥ = min ∥x − m∥,
m∈M m∈M
b ∈ M is characterized by x − x
where we can write min instead of inf because the infimum is achieved. Moreover, x b ⊥ M.
b + m⊥ .
x=x
Hence,
b = m⊥ ∈ M ⊥ =⇒ (x − x
x−x b) ⊥ M.
■
Remark 3.37 You may have observed that X = M ⊕ M ⊥ also shows that x b is unique. While this is true, it is based on Proposi-
tion 3.34, which is true when X is a “complete” inner product space and M is a “closed” subspace, properties that are automatically
satisfied when X is finite dimensional. We will touch on these more subtle properties later when we do some basic Real Analysis.
Our next goal is to turn Theorem 3.36 into a means to compute the best approximation value, x
b. By the Projection Theorem, x
b exists
and is characterized by x − x
b ⊥ M . We write
b = α1 y 1 + α2 y 2 + · · · + αk y k
x
and impose
(x − x b ⊥ y i , 1 ≤ i ≤ k) ⇐⇒ (⟨x − x
b ⊥ M ) ⇐⇒ (x − x b, y i ⟩ = 0, 1 ≤ i ≤ k) ⇐⇒ (⟨b
x, y i ⟩ = ⟨x, y i ⟩ 1 ≤ i ≤ k).
Then,
x, y i ⟩ = ⟨x, y i ⟩ 1 ≤ i ≤ k
⟨b
⇕
⟨α1 y + α2 y + · · · + αk y , y ⟩ = ⟨x, y i ⟩ 1 ≤ i ≤ k
1 2 k i
⇕
α1 ⟨y , y ⟩ + α2 ⟨y , y ⟩ + · · · + αk ⟨y , y ⟩ = ⟨x, y i ⟩ 1 ≤ i ≤ k
1 i 2 i k i
53
We now write these equations out in matrix form.
i = 1 α1 ⟨y 1 , y 1 ⟩ + α2 ⟨y 2 , y 1 ⟩ + · · · + αk ⟨y k , y 1 ⟩ = ⟨x, y 1 ⟩
i = 2 α1 ⟨y 1 , y 2 ⟩ + α2 ⟨y 2 , y 2 ⟩ + · · · + αk ⟨y k , y 2 ⟩ = ⟨x, y 2 ⟩
.. .. (3.4)
. .
i=k α1 ⟨y 1 , y k ⟩ + α2 ⟨y 2 , y k ⟩ + · · · + αk ⟨y k , y k ⟩ = ⟨x, y k ⟩.
Definition 3.39 Equations 3.4 are called the Normal Equations. The Gram Matrix is the k × k matrix Gij := ⟨y i , y j ⟩. The Normal
Equations can also refer to
G⊤ α = β (3.5)
where,
⟨x, y 1 ⟩
1 1
⟨y 1 , y 2 ⟩ ⟨y 1 , y k ⟩
α1 β1 ⟨y , y ⟩ ···
α2 ⟨x, y 2 ⟩ β2 ⟨y 2 , y 1 ⟩ ⟨y 2 , y 2 ⟩ ··· ⟨y 2 , y k ⟩
α := . , β := . =: . , and G := G(y 1 , · · · , y k ) := .
.. .. ..
.. .. .. . . .
αk ⟨x, y k ⟩ βk ⟨y k , y 1 ⟩ ⟨y k , y 2 ⟩ ··· ⟨y k , y k ⟩
Remark 3.40 Because we are assuming F = R, Gij = ⟨y i , y j ⟩ = ⟨y j , y i ⟩ = Gji , and we therefore have G = G⊤ . We’ll show
below that det(G) ̸= 0 if, and only if, {y 1 , y 2 , . . . , y k } is linearly independent. In this case,
b = α1 y 1 + α2 y 2 + · · · + αk y k
x
Proposition 3.41 (Invertibility of the Gram Matrix) Let g(y 1 , . . . , y k ) := det G(y 1 , . . . , y k ) be the determinant of the Gram Matrix.
Then g(y 1 , . . . , y k ) ̸= 0 ⇐⇒ {y 1 , y 2 , . . . , y k } is linearly independent.
Proof: From our construction of the normal equations, G⊤ α = 0 if, and only if
⟨α1 y 1 + α2 y 2 + · · · + αk y k , y i ⟩ = 0, 1 ≤ i ≤ k.
This is equivalent to
α1 y 1 + α2 y 2 + · · · + αk y k ⊥ y i = 0, 1 ≤ i ≤ k,
which is equivalent to
α1 y 1 + α2 y 2 + · · · + αk y k ⊥ span{y 1 , · · · , y k } =: M
and thus
α1 y 1 + α2 y 2 + · · · + αk y k ∈ M ⊥ .
Because α1 y 1 + α2 y 2 + · · · + αk y k ∈ M , we have that
α1 y 1 + α2 y 2 + · · · + αk y k ∈ M ∩ M ⊥ = {0}
Summary 3.42 (The normal equations provide a systematic solution of our best approximation problem). Assume the set
{y 1 , . . . , y k } is linearly independent and M := span{y 1 , · · · , y k }. Then x
b = arg min d(x, M ) = arg min ∥x − m∥ if, and only if,
m∈M
b = α1 y 1 + α2 y 2 + · · · + αk y k
x
G⊤ α = β
(3.6)
Gij = ⟨y i , y j ⟩
βi = ⟨x, y i ⟩.
What changes in each application of the normal equations is typically the inner product, ⟨•, •⟩.
54
Overdetermined Equations
Roughly speaking, a set of linear equations Ax = b is overdetermined when A has more rows than equations. Typically, in
this case, there is no value of x such that Ax − b = 0. Of course, if b = 0 then x = 0 is a solution, and more generally, there
is a solution if, and only if, b ∈ col span{A}, that is, b can be written as a linear combination of the columns of A. When
b ̸∈ col span{A}, it makes sense to seek a “best approximate solution”, which we’ll define to be a value xb that minimizes the
norm of the error in the solution, that is,
b := arg min ∥Ax − b∥.
x
x
Is there a difference between being overdetermined and having no exact solutions? Yes. It’s possible to be overdeter-
mined and still have an exact solution when b ∈ col span{A}. If the columns of A are linearly independent, then
Example 3.43 (Overdetermined system of linear equations in Rn ) Consider Aα = b, where A = n × m real matrix, n ≥ m,
rank(A) = m (columns of A are linearly independent). From the dimension of A, we have that α ∈ Rm , b ∈ Rn . Determine if there
exists a “best approximate solution” using the Euclidean norm.
b ∈ Rm such that
Solution 1: The Standard Problem Formulation and Solution goes like this. We seek α
∥Ab
α − b∥ = minm ∥Aα − b∥,
α∈R
s
n
n
P
where the norm is defined on the error, e := Aα − b ∈ R , namely ∥e∥ = (ei )2 . Hence, the problem is
i=1
q
α
b = arg min (Aα − b)⊤ (Aα − b) = arg min(Aα − b)⊤ (Aα − b).
α∈Rm α∈Rm
The problem can be solved by “completing the square” or by applying vector calculus to compute the gradient, namely,
∂
(Aα − b)⊤ (Aα − b) = A⊤ · (Aα − b),
∂α
setting it to zero, and solving for α. Once you convince yourself that A⊤ A is invertible, the solution is computed to be
A⊤ · (Ab
α − b) = 0 ⇐⇒ (A⊤ A)b
α = A⊤ b ⇐⇒ α
b = (A⊤ A)−1 A⊤ b.
Solution 2: Our Problem Formulation and Solution: The standard formulation puts all the emphasis on the α and misses that Aα is
a linear combination of the columns of A. The problem is calling for a linear combination of the columns of A that best approximates
the vector b. We will therefore apply the normal equations (3.5), where the column span of A is the subspace M and b is the vector
x ∈ X.
n n
Inner product space: X = Rn , F = R, ⟨x, y⟩ = xT y = y T x = xi yi =⇒ ∥x∥2 = ⟨x, x⟩ = |xi |2 .
P P
i=1 i=1
⊤
Partition A = A1 A2 ··· Am and define α = α1 α2 ··· αm , and note that
Aα = α1 A1 + α2 A2 + · · · + αm Am .
Hence, G⊤ = A⊤ A and β = A⊤ b. By Proposition 3.41, G is invertible because the columns of A are linearly independent.
Therefore, we have that
b = (A⊤ A)−1 A⊤ b.
α
■
55
Remark 3.44 Once we note that
A⊤
1
⊤
A2
A⊤ = . when A = A1
A2 ··· Am
..
⊤
Am
The standard “row times column” definition of matrix multiplication gives that (A⊤ A)ij = A⊤ ⊤ ⊤
i Aj . Similarly, (A b)i = Ai b.
Definition 3.45 (Orthogonal Projection Operator) Let X be a finite dimensional (real) inner product space and M a subspace of
X . For x ∈ X and x
b ∈ M . The Projection Theorem shows the TFAE:
(a) x − x
b ⊥ M.
(c) ∥x − x
b∥ = d(x, M ) = inf ∥x − m∥.
m∈M
A function P : X → M by P (x) = x b satisfies any one of (a), (b), or (c), is called the Orthogonal Projection of X onto M .
b, where x
The following sets up the Projection Theorem for underdetermined linear equations.
Lemma 3.47 (Linear Varieties or Translates of Subspaces) Let (X , R, < ·, · >) be a finite-dimensional inner product space. Let
{y1 , · · · , yp } be a linearly independent set in X and let c1 , · · · , cp be real constants. Define
Claim 3.48 There exists a unique x0 ∈ span{y1 , · · · , yp } such that < x0 , yi >= ci , 1 ≤ i ≤ p.
Remark: Another way of stating the Claim is that there exists a unique x0 ∈ X such that
V ∩ span{y1 , · · · , yp } = {x0 }.
⊥
Claim 3.49 Let M = (span{y1 , · · · , yp }) . Then V = x0 + M ; in other words, x ∈ V if, and only if, (x − x0 ) ⊥
span{y1 , · · · , yp }.
Claim 3.50 There exists a unique v ∗ ∈ V having minimum norm, and v ∗ is characterized by v ∗ ⊥ M (just for emphasis, we
note that the result does not say that v ∗ ⊥ V ).
Remarks: We note that v ∗ ⊥ M ⇐⇒ v ∗ ∈ span{y1 , · · · , yp } because X = M ⊕ M ⊥ implies that
⊥
M ⊥ := span{y1 , · · · , yp }⊥ = span{y1 , · · · , yp }.
We are using the standard induced norm, ||x|| =< x, x >1/2 . There exists v ∗ having minimum norm means ||v ∗ || =
inf v∈V ||v||, and thus v ∗ = arg minv∈V ||v|| = arg minv∈V ||v||2 .
56
Theorem 3.51 Let (X , R, < ·, · >) be a finite-dimensional inner product space. Let {y1 , · · · , yp } be a linearly independent set in X
and let c1 , . . . , cp be real constants. Define V = {x ∈ X | < x, yi >= ci , 1 ≤ i ≤ p}. Then there exists a unique v ∗ ∈ V such that
Proof: From Claim 3.50 and the remark that follows it, we have v ∗ ∈ span{y1 , . . . , , yp } and thus we write v ∗ = β1 y1 + · · · + βp yp .
Then imposing that v ∗ ∈ V immediately gives (3.8). ■
Remark 3.52 Equation (3.7) is a special case of a Quadratic Program, typically called a QP for short. It has a quadratic cost,
∥v∥2 = ⟨v, v⟩, and a set of linearly independent equality constraints, namely v ∈ V ⇐⇒ ⟨v, yi ⟩ = ci , 1 ≤ i ≤ p.
Proposition 3.53 If A is n × n and real, then its e-values and e-vectors occur in complex conjugate pairs.
Proof: Let λ ∈ C and (v ∈ Cn , v ̸= 0) satisfy Av = λv. To show: Av = λv, that is, if λ is a e-value with e-vector v, then its
complex conjugate λ is also an e-value with e-vector v. The proof relies on the fact that if z1 , z2 ∈ C, then z1 · z2 = z 1 · z 2 , and its
natural extension to matrices and vectors that the reader can show.
• A · v = A · v = A · v (because A is real).
• A · v = λ · v = λ · v (because Av = λv)
Hence, as we needed to show, Av = λv. ■
Proposition 3.55 If A is real (i.e., A = A) and symmetric (i.e., A⊤ = A), then ∀ x, y ∈ Cn , ⟨Ax, y⟩ = ⟨x, Ay⟩.
Proof: Using A⊤ = A, we obtain ⟨Ax, y⟩ = x⊤ A⊤ y = x⊤ Ay. Using A = A, we obtain ⟨x, Ay⟩ = x⊤ Ay = x⊤ Ay = x⊤ Ay.
Hence, ⟨Ax, y⟩ = ⟨x, Ay⟩ = x⊤ Ay. ■
⇕
λ = λ,
because ||v|| =
̸ 0. ■
57
Remark 3.57 We now know that when A is real and symmetric, any eigenvalue λ is real, and therefore we can assume the corre-
sponding eigenvector is real. Indeed,
(A − λI) v = 0.
| {z }
real
Hence we have 0 ̸= v ∈ Rn and we can use the real inner product on Rn , namely ⟨x, y⟩ = x⊤ y.
Proposition 3.58 Let A be an n × n real symmetric matrix and let λ1 and λ2 be distinct (real) e-values. Then the corresponding
(real) e-vectors are orthogonal.
⟨Av 1 , v 2 ⟩ = ⟨v 1 , Av 2 ⟩
⇕
⟨λ1 v , v ⟩ = ⟨v 1 , λ2 v 2 ⟩
1 2
⇕
λ1 ⟨v , v ⟩ = λ2 ⟨v 1 , v 2 ⟩
1 2
⇕
0 = (λ1 − λ2 )⟨v 1 , v 2 ⟩.
Proposition 3.59 The e-vectors of an n × n real symmetric matrix can always be chosen to form an orthonormal basis of Rn .
Proof: Proposition 3.58 handles the case that the e-values of A are distinct. A HW assignment will treat the case of repeated e-values.
■
n
Remark 3.60 Real symmetric matrices are special in that one can ALWAYS find a basis of R consisting of e-vectors. Recall that
0 1
this is false for general (real) matrices as shown by the example A = .
0 0
Proposition 3.63 Suppose A is an n × n real symmetric matrix. Then there exists an orthogonal matrix Q such that
Q⊤ AQ = Λ = diag(λ1 , . . . , λn ).
Proof: By Proposition 3.59, there exists an orthonormal basis of Rn such that Av i = λi v i , 1 ≤ i ≤ n. We define
Qn =: v 1 v2 vn ,
Q := Q1 Q2 ··· ···
Exercise 3.64 Let Q be an n × n orthogonal matrix and consider (Rn , R, ∥ • ∥) with the Euclidean norm. Then ∀ x ∈ Rn , ∥Qx∥ =
∥x∥, that is, orthogonal matrices are norm preserving. (Hint: Work with ∥Qx∥2 .)
58
3.6 Quadratic Forms, Positive Definite Matrices, and Schur Complements
We are building toward estimation (or best approximation) problems where some measurements are “less certain” that others, and
hence we need unequal weights on our error terms.
Remark 3.65 (Useful Observation) Let A be m × n real matrix. Then both A⊤ A and AA⊤ are symmetric, and hence their eigen-
values are real.
Claim 3.66 The eigenvalues of A⊤ A and AA⊤ are non-negative (real numbers).
v ⊤ A⊤ Av = v ⊤ λv
⇕
⟨Av, Av⟩ = λ⟨v, v⟩
⇕
∥Av∥2 = λ∥v∥2 .
Definition 3.67 Let M be an n × n real matrix and x ∈ Rn . Then x⊤ M x is called a quadratic form.
Exercise 3.69 If W is skew symmetric, then x⊤ W x = 0 for all x ∈ Rn . (Hint: use the fact that a real number is equal to its
transpose to show that x⊤ W x = x⊤ W ⊤ x = −x⊤ W x.)
M + M⊤ M − M⊤
M= + .
2 } | {z
| {z 2 }
symmetric skew symmetric
M +M ⊤
Definition 3.71 2 is called the symmetric part of M .
M +M ⊤
Exercise 3.72 For any real square matrix M , x⊤ M x = x⊤ 2 x.
Remark 3.73 Consequently, when working with a quadratic form, one always assumes M is symmetric.
Proposition 3.74 (E-value Bounds of Symmetric Matrices) Let M be an n × n real symmetric matrix. Then ∀ x ∈ Rn ,
λmin x⊤ x ≤ x⊤ M x ≤ λmax x⊤ x,
Proof: By Proposition 3.59, we can choose an orthonormal basis v = {v 1 , v 2 , . . . , v n } of Rn consisting of eigenvectors of M . For
x ∈ Rn , we expand it in terms of the basis vectors as x = α1 v 1 + α2 v 2 + · · · + αn v n , and then using the fact that the basis is
orthonormal, we have that
Xn
x⊤ x = αi2 .
i=1
Next, we use the fact that the v are e-vectors so that M x = α1 λ1 v 1 + α2 λ2 v 2 + · · · + αn λn v n . Another straightforward calculation
i
gives that
X n
x⊤ M x = λi αi2 .
i=1
59
It follows that ! !
n
X n
X n
X
min{λ1 , . . . , λn } αi2 ≤ λi αi2 ≤ max{λ1 , . . . , λn } αi2
i=1 i=1 i=1
and hence
λmin x⊤ x ≤ x⊤ M x ≤ λmax x⊤ x.
■
Definition 3.75 A real symmetric matrix P is positive definite if (∀ x ∈ Rn , x ̸= 0) =⇒ (x⊤ P x > 0).
Notation 3.76 One writes P > 0 to indicated that P is positive definite. To be absolutely clear, P > 0 does not mean that all
entries of P are positive!
Theorem 3.77 (E-value Test for a Symmetric Matrix) A symmetric real matrix P is positive definite if, and only if, all of its
eigenvalues are strictly greater than zero.
Proof: From Proposition 3.74, if λmin > 0, then for all 0 ̸= x ∈ Rn , x⊤ P x > 0 and hence P is positive definite. For the other
direction, suppose that 0 ̸= v ∈ Rn satisfies P v = λv and λ ≤ 0. Then v ⊤ P v = λv ⊤ v = λ∥v∥2 ≤ 0, and hence P is not positive
definite. ■.
2 −1 1 2
Exercise 3.78 Show Pa := > 0 and Pb := is not positive definite. (Hint: Compute their e-values.)
−1 2 2 1
• P ≥ 0 =⇒ its diagonal entries are non-negative. (Hint: Try x with a single one and the remaining entries all zero.)
• The above conditions are necessary but not sufficient for P to be positive semi-definite and positive definite, respectively.
Theorem 3.81 (E-value Test for P ≥ 0) P is positive semidefinite if, and only if, all of its eigenvalues are non-negative.
Proof: Follow the proof of Theorem 3.77 mutatis mutandis, that is, follow the proof and make the necessary small adjustments. ■
Remark 3.84 There are several notions of a square root of a matrix. We are using the simplest one. From Wikipedia (https://
en.wikipedia.org/wiki/Square_root_of_a_matrix) In mathematics, the square root of a matrix extends the notion
of square root from numbers to matrices. A matrix B is said to be a square root of A if the matrix product BB is equal to A. Some
authors use the name square root or the notation A1/2 only for the specific case when A is positive semidefinite, to denote the unique
matrix B that is positive semidefinite and such that BB = B ⊤ B = A for real-valued matrices. Less frequently, the name square
root may be used for any factorization of a positive semidefinite matrix A as B ⊤ B = A, as in the Cholesky factorization, even if
BB ̸= A. We are doing the latter.
Proof:
60
2. Now suppose P ≥ 0. To show: ∃ N such that N ⊤ N = P .
Since P is real and symmetric, there exists an orthogonal matrix O such that
P = O⊤ ΛO
so that
(Λ1/2 )⊤ Λ1/2 = Λ1/2 Λ1/2 = Λ.
Let N = Λ1/2 O, then
⊤
N ⊤ N = O⊤ Λ1/2 Λ1/2 O = O⊤ ΛO = P.
Therefore N ⊤ N = P . ■
(x + y)⊤ P (x + y) = x⊤ P x + y ⊤ P y + 2x⊤ P y.
Theorem 3.87 (Schur Complements) Suppose that A, B and C are real matrices, A = n × n is symmetric, B = n × m, and
C = m × m is symmetric. Then for the symmetric matrix
A B
M= ,
B⊤ C
Proof: We will show (a) ⇐⇒ (b). The proof of (a) ⇐⇒ (c) is identical. First, let’s show (a) =⇒ (b). Suppose M > 0, Then
for all x ∈ Rn , x ̸= 0,
⊤
x x
M > 0.
0 0
Expanding this out gives
⊤
x A B x Ax
= x⊤ = x⊤ Ax
0< 0
0 B⊤ C 0 B⊤x
and therefore A is positive definite. We will make a nice choice of x and y to show C − B ⊤ A−1 B > 0. Let y ̸= 0 be otherwise
arbitrary and suppose we choose x such that Ax + By = 0. This is motivated by zeroing the top row of M when multiplying by
⊤
the vector x⊤ y ⊤ . Such a choice of x is always possible because we know that A > 0, which implies A is invertible and hence
Hence, (y ∈ Rm , y ̸= 0), implies y ⊤ (C − B ⊤ A−1 B)y > 0, and therefore C − B ⊤ A−1 B > 0.
61
the proof, we show (b) =⇒ (a). Hence,
To complete we suppose
A > 0, C−B ⊤ A−1 B > 0 and seek to show M > 0. For an
x x 0 x̄ 0
arbitrary , define x̄ = x + A−1 By and note that ̸= ⇐⇒ ̸= . Substituting in and applying Exercise 3.86 yields
y y 0 y 0
⊤ ⊤
x̄ − A−1 By x̄ − A−1 By
x x
M = M
y y y y
⊤ ⊤ ⊤
−A−1 By −A−1 By −A−1 By
x̄ x̄ x̄
= M + M +2 M
0 0 y y 0 y
| {z } | {z } | {z }
x̄⊤ Ax̄ y ⊤ (C−B ⊤ A−1 B)y 0
Definition 3.88 C − B ⊤ A−1 B is the Schur Complement of A in M and A − BC −1 B ⊤ is the Schur Complement of C in M .
to be positive definite.
Example 3.90 Test whether the given matrices are positive definite or not.
α 1 1
3 −2 2 3
M1 = , M2 = , and M3 = 1 2 1 ,
−2 3 3 2
1 1 3
where α ∈ R.
Solution: For M1 , we can apply the result of Example 3.89 to determine that a = 2 > 0 and det(M1 ) = 5 > 0, and hence M1 > 0.
Doing the same for M2 , we have that a = 2 > 0 and det(M2 ) = −5 < 0 and hence M2 is not positive definite.
and apply condition (c) of Theorem 3.87. From the result in Example 3.89 we quickly see that C > 0. Hence, we next form
−1 ⊤
−1 1
A − BC B = α − 1 1 C = α − 0.6.
1
62
What if we use the same decomposition of M and apply condition (b) of Theorem 3.87? Then we test
A > 0 ⇐⇒ α > 0
1 1
⊤ −1 2− 1−
C − B A B > 0 ⇐⇒ α
1
α
1 > 0.
1− α 3− α
It’s more complicated to work out, but by applying Example 3.89 again, we end up with three conditions
α > 0 ⇐⇒ α ∈ (0, ∞)
1
2 − > 0 =⇒ (α < 0) ∨ (α > 1/2) ⇐⇒ α ∈ (−∞, 0) ∪ (1/2, ∞)
α
(5α − 3)/α > 0 =⇒ (α < 0) ∨ (α > 3/5) ⇐⇒ α ∈ (−∞, 0) ∪ (3/5, ∞),
which must be simultaneously true1 . All of these conditions are met for α > 0.6, which agrees with our previous computation.
The condition obtained for α will be the same. The question is, which partition of M3 and which condition of Theorem 3.87 yield
the easiest test (for a particular problem)?
■
Proposition 3.91 (Weighted Least Squares, Overdetermined Equations). Let S be an n × n positive definite matrix (S > 0) and
let the inner product on Rn be
⟨x, y⟩ := x⊤ Sy,
so that ∥x∥2 = ⟨x, x⟩ = x⊤ Sx. Consider the overdetermined equation Aα = b, where A = n × m, n ≥ m, rank(A) = m, α ∈
Rm , and b ∈ Rn . Then
x
b=α b 2 A2 + · · · + α
b 1 A1 + α b m Am
⊤
b = β, with G = G⊤
G α
[G⊤ ]ij = [G]ij = ⟨Ai , Aj ⟩ = A⊤ ⊤
i SAj = [A SA]ij
βi = ⟨b, Ai ⟩ = b⊤ SAi = A⊤ ⊤
i Sb = [A Sb]i
⇕
⊤
α = A⊤ Sb.
A SAb
Because rank(A) = m implies the columns of A are linearly independent, we know that the Gram matrix A⊤ SA is invertible.
Hence,
b = (A⊤ SA)−1 A⊤ Sb.
α
■
1 A great way to think about it is M > 0 if, and only if, α ∈ S ∩ S ∩ S , where S = (0, ∞), S = (−∞, 0) ∪ (1/2, ∞), and S = (−∞, 0) ∪ (3/5, ∞).
3 1 2 3 1 2 3
Calculation of the indicated set intersections yields S1 ∩ S2 ∩ S3 = (3/5, ∞).
63
Proposition 3.92 (Recursive Least Squares) Consider the model
yi = Ci x + ei , i = 1, 2, 3, · · ·
Ci ∈ Rm×n
Model: i = time index (3.9)
x = an unknown constant vector ∈ Rn
yi = measurements ∈ Rm
ei = model “mismatch” ∈ Rm
for generating the data stream {y1 , y2 , . . .}. Let k0 be the smallest k ≥ 1 such that rank([C1⊤ C2⊤ · · · Ck⊤0 ]) = n. Then, for all
k ≥ k0 , a solution to
k
!
X
⊤
x
bk : = arg min (yi − Ci x) Si (yi − Ci x)
x∈Rn i=1
k
!
X
= arg min e⊤
i Si ei
x∈Rn i=1
where Si = m × m positive definite matrix (Si > 0 for all time indices i) can be computed recursively by
⊤
x
bk+1 = x
bk + Pk+1 Ck+1 Sk+1 (yk+1 − Ck+1 x
bk ) .
| {z }| {z }
“Kalman gain” “Innovations”
⊤ ⊤ −1 −1
Pk+1 = Pk − Pk Ck+1 Ck+1 Pk Ck+1 + Sk+1 Ck+1 Pk
"k #−1
X0
Pk0 := Ci⊤ Si Ci
i=1
Proof:
Step 1 is a Batch Solution that can be used for initialization: For k ≥ 1, define
y1 C1 e1 S1
y2 C2 e2 0
S2
Yk = . , Ak = . , Ek = . , and Rk = = diag(S1 , S2 , · · · , Sk ) > 0.
.. .. .. ..
0 .
yk Ck ek Sk
Then Yk satisfies Yk = Ak x + Ek , k ≥ 1.
Claim 3.93 Suppose rank(Ak0 ) = n, the dimension of x. Then, for all k ≥ k0 , rank(Ak ) = n.
Next we note that ∥Yk − Ak x∥2 = (Yk − Ak x)⊤ Rk (Yk − Ak x), and thus, by Proposition 3.91, for all k ≥ k0 ,
bk = (A⊤
x k Rk Ak )
−1 ⊤
Ak Rk Yk ,
64
because all of the measurements are processed together as a batch to best approximate x. The drawback is that Ak is a km × n
matrix, and grows at each step. On a real robot, if you were collecting data at a KHz or more, you’d quickly fill your memory and
forming A⊤k Rk Ak would take longer and longer at each step. The solution is to find a recursive means to compute xbk+1 in terms of
x
bk and the new measurement yk+1 .
k
! k
X X
Ci⊤ Si Ci x
bk = Ci⊤ Si yi .
i=1 i=1
We define
k
X
Qk = Ci⊤ Si Ci
i=1
so that
⊤
Qk+1 = Qk + Ck+1 Sk+1 Ck+1 .
And the normal equations at time k + 1 become,
k+1
X k+1
X
( Ci⊤ Si Ci ) x
bk+1 = Ci⊤ Si yi
i=1 i=1
| {z }
Qk+1
or
k
X
Qk+1 x
bk+1 = Ci⊤ Si yi +Ck+1
⊤
Sk+1 yk+1 .
i=1
| {z }
Qk x
bk
⊤
Qk+1 = Qk + Ck+1 Sk+1 Ck+1
⊤
Qk+1 x
bk+1 = Qk x
bk + Ck+1 Sk+1 yk+1 .
The estimate at time k + 1 is expressed as a linear combination of the estimate at time k and the latest measurement at time k + 1.
Continuing,
bk+1 = Q−1 ⊤
x k+1 Qk x
bk + Ck+1 Sk+1 yk+1 .
⊤
Because Qk = Qk+1 − Ck+1 Sk+1 Ck+1 , we have
x bk + Q−1
bk+1 = x ⊤
k+1 Ck+1 Sk+1 (yk+1 − Ck+1 x
bk ) .
| {z }| {z }
Kalman gain Innovations
The term (yk+1 − Ck+1 x bk ) is called the innovations because it is effectively the “new information” provided by the measurement
at time k + 1. If there were no measurement, we could predict an estimate for y at time k + 1 by ybk+1 := Ck+1 x bk (recall that our
model assumes that x is a constant vector, hence if we have an estimate of x at time step k, our best guess for it at time k + 1, absent
any further measurements, would simply be that it is unchanged from our previous estimate). The innovation is
the difference between what we measure and what we predict. If our prediction does not change with yk+1 , then the measurement
was not “innovative”. The appellation “Kalman Gain” does not have any grounding at this point. But when we treat the Kalman filter
a bit later in these notes, you’ll see that the Kalman filter has a measurement update gain that looks just like the one in RLS (recursive
least squares).
65
In a real-time implementation, computing the inverse of Qk+1 can be time consuming. An attractive alternative can be obtained by
applying the Matrix Inversion Lemma,
−1 −1
(A + BCD) = A−1 − A−1 B DA−1 B + C −1 DA−1
we have
⊤ ⊤ −1 −1
Pk+1 = Pk − Pk Ck+1 Ck+1 Pk Ck+1 + Sk+1 Ck+1 Pk
We note that we are now inverting a matrix that is m × m, instead of one that is n × n. Typically, n >> m (means very much greater
than), and thus the savings can be important.
■
Remark 3.94 (Recursive least squares (RLS) with a “forgetting factor”) is treated in HW #06. The forgetting factor allows you to
exponentially “discount” older measurements. This is important if the paramter you are trying to estimate is slowly drifting.
• the solution is unique if, and only if, the columns of A are linearly independent; and thus
• if there exists one solution and the columns of A are linearly dependent, then there exist an infinite number of solutions.
Underdetermined Equations
The columns of A will be linearly dependent when Ax = b has fewer equations than unknowns. In other words, A is n × m
and m > n; sometimes these are called wide matrices: more columns than rows. When dealing with an equation Ax = b
with fewer equations than unknowns, one says that it is underdetermined. Why? Because, to determine x uniquely, at a
minimum, we need as many equations as unknowns.
Is there a difference between being underdetermined and having an infinite number of solutions? Yes. It’s possible to
be underdetermined and have no solution at all when b ̸∈ col span{A}. If the rows of A are linearly independent, then
The rows of A being linearly independent is equivalent to the columns of A⊤ being linearly independent.
When Ax = b has an infinite number of solutions, is there a way that we can make one of them appear to be more interesting, more
special, or just flat out “better” than all the other solutions? Is there a property that we could associate with each solution and op-
timize our choice of solution with respect to that property? The most common approach is to choose the solution with minimum norm!
Proposition 3.95 (Underdetermined Equations) Consider the real finite dimensional inner product space (Rn , R, ⟨•, •⟩) where <
x, z >:= x⊤ Sz and S > 0. If the rows of A are linearly independent, then
−1
b = S −1 A⊤ β, AS −1 A⊤ β = b or, equivalently, x
satisfies x b = S −1 A⊤ AS −1 A⊤ b.
66
Proof: The main idea is to express the constraint Ax = b in terms of the inner product, which means to identify vectors {v 1 , . . . , v p }
and constants c1 , . . . , cp such that Ax = b ⇐⇒ ⟨v i , x⟩ = ci , 1 ≤ i ≤ p, so that Theorem 3.51 is applicable. To identify the required
vectors and constants, we partition A by rows, that is
a1
..
A =: . ,
ap
and note that Ax = b ⇐⇒ AS −1 Sx = b. Then, based upon the row by row interpretation of AS −1 Sx = b, we define vi ∈ Rn by
(vi )⊤ := ai S −1 ⇐⇒ vi = (ai S −1 )⊤ = S −1 a⊤
i ,
where G⊤ β = b and G is the Gram matrix. The last part is to show that
v1 v2 · · · vp = S −1 a⊤ S −1 a⊤ · · · S −1 a⊤ = S −1 a⊤ a⊤ a⊤ = S −1 A⊤
1 2 p 1 2 ··· p
■
Examples of working with these results are included in the HW sets.
67
68
Chapter 4
Learning Objectives
• Introduce the notion of factoring a matrix as a means of solving systems of linear equations
• See important applications of the work we have done in Chapters 2 and 3.
Outcomes
• Learn how to compute and use a QR Factorization.
• Understand that the theoretical definition of linear independence may not be adequate for engineering practice.
• Learn about the Singular Value Decomposition, a workhorse in numerical linear algebra, that addresses the above issue and
much more.
• Learn the LU factorization, the LDLT (or Cholesky) Factorization, and their uses.
69
4.1 QR Factorization
Definition 4.1 An n × m matrix R is upper triangular if Rij = 0 for all i > j.
Theorem 4.2 (QR Decomposition or Factorization) Let A be a real n × m matrix with linearly independent columns. Then there
exist an n × m matrix Q with orthonormal columns and an m × m upper triangular matrix R such that
A = QR.
Remark 4.3
1. Q⊤ Q = Im×m
r11 r12 ··· r1(m−1) r1m
0 r22 ··· r2(m−1) r2m
2. R =
0 0 r33 ··· r3m
.. .. .. .. ..
. . . . .
0 ··· ··· 0 rmm
Proof: The proof is organized around the computation of the factorization by the Gram-Schmidt Algorithm with Normalization.
Partition A into columns, A = A1 A2 · · · Am , Ai ∈ Rn , and use the inner product ⟨x, y⟩ = x⊤ y. For 1 ≤ k ≤ n,
{A1 , A2 , · · · , Am } → {v1 , v2 , · · · , vm } by
for k = 1 : m
v k = Ak
for j = 1 : k − 1
v k = v k − ⟨Ak , v j ⟩v j
end
k
v k = ∥vvk ∥
end
By construction, Q := v 1 v2 vm has orthonormal columns, and hence Q⊤ Q = Im×m because [Q⊤ Q]ij = ⟨v i , v j ⟩ =
···
1, i = j and zero otherwise.
⟨A1 , v 1 ⟩
..
.
⟨Ai , v i ⟩
Ri :=
,
0
..
.
0
where Rij = 0 for i < j ≤ n. The coefficients in R can be extracted directly from the Gram-Schmdit Algorithm; no extra computa-
tions are required. By construction, Ai = QRi and thus we have A = QR.
Example 4.4 (QR Decomposition of Overdetermined Equations) Suppose that Ax = b is overdetermined with columns of A linearly
70
independent. Write A = QR and consider
A⊤ Ab
x = A⊤ b
⇕
⊤ ⊤
x = R⊤ Q⊤ b
R Q QRb
⇕
⊤
x = R⊤ Q⊤ b
R Rb (because Q⊤ Q = I)
⇕
x = Q⊤ b
Rb (because R is invertible)
and therefore, x
b3 to x
b1 can be obtained easily without performing a matrix inverse.
Example 4.5 (QR Decomposition of Underdetermined Equations) Suppose Ax = b is underdetermined with rows of A linearly
independent. For the inner product ⟨x, y⟩ = x⊤ y, x
b = A⊤ (AA⊤ )−1 b is value of x of smallest norm satisfying Ax = b.
A⊤ has linearly independent columns, and hence we write A⊤ = QR, Q⊤ Q = I, R is upper triangular and invertible. It follows
that
b = A⊤ (AA⊤ )−1 b
x
⇕
−1
b = QR R⊤ Q⊤ QR
x b
⇕
b = QR(R⊤ R)−1 b
x
⇕
b = QRR−1 (R⊤ )−1 b
x
⇕
b = Q(R⊤ )−1 b.
x
4.2.1 Motivation
In abstract linear algebra, a set of vectors is either linearly independent or not. There is nothing in between. For example, the set of
vectors
1 0.999
v1 = , v2 =
1 1
is linearly independent. In this case, one looks at the set of vectors and says, yes, BUT, the vectors are “almost” dependent because
when one computes the determinant
1 0.999
det = 0.001,
1 1
the result is pretty small, so it should be fine to call them dependent.
71
Well, what about the set
104
1
v1 = , v2 = ?
0 1
When you form the matrix and check the determinant, you get
104
1
det = 1,
0 1
which seems pretty far from zero. So are these vectors “adequately” linearly independent?
Σij = 0 for i ̸= j.
The diagonal of Σ is the set of all Σii , 1 ≤ i ≤ min(n, m). An alternative and equivalent way to define a Rectangular Diagonal
Matrix is
Σd
(a) (tall matrix) n > m Σ = , where Σd is an m × m diagonal matrix.
0
(b) (wide matrix) n < m Σ = Σd 0 , where Σd is an n × n diagonal matrix.
Theorem 4.7 (SVD or Singular Value Decomposition) Every n × m real matrix A can be factored as
A = U · Σ · V ⊤,
where U is an n × n orthogonal matrix, V is an m × m orthogonal matrix, Σ is an n × m rectangular diagonal matrix, and the
diagonal of Σ,
diag(Σ) = [σ1 , σ2 , · · · , σp ] ,
satisfies σ1 ≥ σ2 ≥ · · · ≥ σp ≥ 0, for p := min(n, m). Moreover, the columns of U are eigenvectors of A · A⊤ , the columns of
V are eigenvectors of A⊤ · A, and {σ12 , σ22 , . . . , σp2 } are eigenvalues of both A⊤ · A and A · A⊤ . The Singular Values of A are the
elements {σ1 , . . . , σp } from the diagonal of Σ.
Proof: A⊤ A is m × m, real, and symmetric. Hence, there exists a set of orthonormal eigenvectors {v 1 , . . . , v m } such that
A⊤ Av j = λj v j .
Without loss of generality, we can assume that λ1 ≥ λ2 ≥ · · · ≥ λm ≥ 0. If not, we simply re-order the v i ’s to make it so. For
λj > 0, say 1 ≤ j ≤ r, we define
p
σj = λj
and
1
qj = Av j ∈ Rn
σj
72
(
i ⊤
1 i=j
Claim 4.9 For 1 ≤ i, j ≤ r, q q j = δij = . That is, the vectors {q 1 , q 2 , . . . , q r } are orthonormal.
0 i ̸= j
Proof of Claim:
⊤ 1 1 ⊤
qi qj = v i A⊤ Av j
σi σj
λj ⊤
= vi vj
σi σj
(
λi
2 i=j
= (σi )
0 i ̸= j
(
1 i=j
=
0 i ̸= j
□
Claim 4.10 The vectors {q 1 , q 2 , . . . , q r } are eigenvectors of AA⊤ and the corresponding e-values are {λ1 , λ2 , . . . , λr }.
Proof of Claim: For 1 ≤ i ≤ r, λi > 0 and
1
AA⊤ q i :=AA⊤ Av i
σi
1
= A A⊤ A v i
σi
λi
= Av i
σi
=λi q i ,
and thus q i is an e-vector of AA⊤ with e-value λi . The claim is also an immediate consequence of Lemma 2.63. □
From Fact 2.61, if r < n, then the remaining e-values of AA⊤ are all zero. Moreover, we can extend the q i ’s to an orthonormal basis
for Rn satisfying AA⊤ q i = 0, for r + 1 ≤ i ≤ n. Define
U := q 1 q 2 · · · q n and V := v 1 v 2 · · · v m .
Also, define Σ = n × m by (
σi δij 1 ≤ i, j ≤ r
Σij =
0 otherwise.
Then, Σ is rectangular diagonal with
diag (Σ) = [σ1 , σ2 , · · · , σr , 0, · · · , 0]
To complete the proof of the theorem, it is enough to show1 that U ⊤ AV = Σ. We note that the ij element of this matrix is
(U ⊤ AV )ij = qi⊤ Av j
If j > r, then A⊤ Av j = 0, and thus (q i )⊤ Av j = 0, as required. If i > r, then q i was selected to be orthogonal to
1 1 1
{q 1 , · · · , q r } = { Av 1 , Av 2 , · · · , Av r }
σ1 σ2 σr
⊤
and thus q i Av j = 0. Hence we now consider 1 ≤ i, j ≤ r and compute that
1 i ⊤ ⊤ j
U ⊤ AV
ij
= v A Av
σi
λj i ⊤ j
= v v
σi
= σi δij
1 Because U ⊤ U = I and V ⊤ V = I, it follows that A = U ΣV ⊤ ⇐⇒ U ⊤ AV = Σ.
73
as required.
■
This formula follows from Fact2.67, matrix multiplication based on the sum over columns times rows, where we note that the columns
of V are the rows of V ⊤ .
Example 4.12 Determine the SVD of A as well as its rank and nullity,
1 104
A= .
0 1
There are two non-zero singular values, and thus r = 2. It follows that rank(A) = 2 and nullity(A) = 0.
Information about the “near” linear dependence of the columns of A is in the diagonal matrix Σ. There are two singular values,
σ1 = 104 and σ2 = 10−4 . Their ratio is 108 , which is an indicator that these vectors are “nearly linearly dependent”. “Numerically”,
one would say that r = 1 and hence rank(A) = r = 1 and nullity(A) = 2 − r = 1. ■
1 using LinearAlgebra
2
9 (U ,Sigma, V) = svd(A)
one obtains
74
−2.475e − 01 −5.600e − 01 4.131e − 01 5.759e − 01 3.504e − 01
−3.542e − 01 −5.207e − 01 −7.577e − 01 −1.106e − 02 −1.707e − 01
U =
−4.641e − 01 6.013e − 01 −1.679e − 01 6.063e − 01 −1.652e − 01
−5.475e − 01 −1.183e − 01 4.755e − 01 −3.314e − 01 −5.919e − 01
−5.460e − 01 1.992e − 01 −2.983e − 02 −4.369e − 01 6.859e − 01
1.325e + 02 0.000e + 00 0.000e + 00 0.000e + 00 0.000e + 00
0.000e + 00 3.771e + 01 0.000e + 00 0.000e + 00 0.000e + 00
Σ= 0.000e + 00 0.000e + 00 3.342e + 01 0.000e + 00 0.000e + 00
0.000e + 00 0.000e + 00 0.000e + 00 1.934e + 01 0.000e + 00
0.000e + 00 0.000e + 00 0.000e + 00 0.000e + 00 7.916e − 01
4.307e − 01 8.839e − 01 −5.303e − 02 8.843e − 02 −1.503e − 01
4.309e − 01 −2.207e − 01 −1.961e − 01 7.322e − 01 4.370e − 01
4.617e − 01
V = −8.902e − 02 7.467e − 01 −3.098e − 01 3.539e − 01
4.730e − 01 −3.701e − 01 7.976e − 02 1.023e − 01 −7.890e − 01
4.380e − 01 −1.585e − 01 −6.283e − 01 −5.913e − 01 1.968e − 01
Because the smallest singular value σ5 = 0.7916 is less than 1% of the largest singular value σ1 = 132.5, in many cases, one
would say that the numerical rank of A was 4 instead of 5.
This notion of numerical rank can be formalized by asking the following question: Suppose rank(A) = r. How far away is A
from a matrix of rank strictly less than r?
The numerical rank of a matrix is based on the expansion in (4.11), which is repeated here for convenience,
p
X
A=U ·Σ·V⊤ = σi ui · vi⊤ = σ1 u1 · v1⊤ + σ2 u2 · v2⊤ + · · · + σp up · vp⊤ ,
i=1
where p = min{m, n}, and once again, the singular values are ordered such that σ1 ≥ σ2 ≥ · · · ≥ σp ≥ 0. Each term ui · vi⊤ is a
rank-one matrix. The following will help you understand the expansion.
Exercises or Facts:
Pp
• A · A ⊤ = U · Σ · Σ⊤ · U ⊤ = i=1 (σi )
2
ui · u⊤
i
Pp
• A⊤ · A = V · Σ⊤ · Σ · V ⊤ = i=1 (σi )2 vi · vi⊤
(
⊤ ui j = i
and hence rank(ui · vi⊤ ) = 1 and nullity(ui · vi⊤ ) = m − 1
• ui · vi · vj =
0 j ̸= i
(
⊤ ui j = i
and hence rank(ui · u⊤ ⊤
• ui · ui · uj = i ) = 1 and nullity(ui · ui ) = n − 1
0 j ̸= i
(
⊤ vi j = i
and hence rank(vi · vi⊤ ) = 1 and nullity(vi · vi⊤ ) = m − 1
• vi · vi · vj =
0 j ̸= i
• vi · vi⊤ , and ui · u⊤
i have e-values λ1 = 1 distinct and λ2 = 0 repeated m − 1 and n − 1 times, respectively.
(
ui j = i
• Hint: ui · vi⊤ · vj = ui · vi⊤ · vj =
because the {v1 , v2 , . . . , vm } are orthonormal.
0 j ̸= i
So far, we have only defined the norm of a vector. However, it is also useful to measure the “length” of matrices.
75
Definition 4.13 (Induced Matrix Norm) Given an n × m real matrix A, the matrix norm induced by the Euclidean vector norm is
given by:
q
||A|| := max ||Ax|| = λmax (A⊤ A)
x⊤ x=1
where λmax (A⊤ A) denotes the largest eigenvalue of the matrix A⊤ A. (Recall that the matrices of the form A⊤ A are at least
positive semidefinite and hence their e-values are real and non-negative. Therefore, the square root exists.)
Numerical Rank
Facts: Suppose that rank(A) = r, so that σr is the smallest non-zero singular value of A.
Corollary: Suppose A is square and invertible. Then σr measures the distance from A to the nearest singular matrix.
Illustration Continued
4.169e − 02 −1.212e − 01 −9.818e − 02 2.189e − 01 −5.458e − 02
−2.031e − 02 5.906e − 02 4.784e − 02 −1.066e − 01 2.659e − 02
E=
−1.966e − 02 5.716e − 02 4.629e − 02 −1.032e − 01 2.574e − 02
−7.041e − 02 2.048e − 01 1.658e − 01 −3.697e − 01 9.220e − 02
8.160e − 02 −2.373e − 01 −1.922e − 01 4.284e − 01 −1.068e − 01
7.376e − 09
q
2.406e − 09
⊤
λi (E · E) = 1.977e − 09
4.163e − 09
7.916e − 01
1.325e + 02 0.000e + 00 0.000e + 00 0.000e + 00 0.000e + 00
0.000e + 00 3.771e + 01 0.000e + 00 0.000e + 00 0.000e + 00
Σ2 =
0.000e + 00 0.000e + 00 3.342e + 01 0.000e + 00 0.000e + 00
0.000e + 00 0.000e + 00 0.000e + 00 1.934e + 01 0.000e + 00
0.000e + 00 0.000e + 00 0.000e + 00 0.000e + 00 1.775e − 15
We added a matrix with norm 0.7916 and made the (exact) rank drop from 4 to 5! How cool is that? This example shows that SVD
can exactly measure how close a matrix is to being singular. We also see that E ⊤ · E has rank one: there is one non-zero e-value and
76
the rest are (essentially) zero as the theory promised.
(c) Fact: Suppose A = U · Σ · V ⊤ . Then the columns of U corresponding to non-zero singular values are a basis for
range(A) and the columns of V corresponding to zero singular values are a basis for null(A), viz
(d) The SVD can also be used to compute an “effective” range and an “effective” null space of a matrix.
(e) Fact: Suppose that σ1 ≥ ... ≥ σr > δ ≥ σr+1 ≥ ...σp ≥ 0, so that r is the “effective” or “numerical rank” of A.
(Note the δ inserted between σr and σr+1 to denote the break point.)
(f) Fact: Let rangeeff (A) and nulleff (A) denote the effective range and effective null space of A, respectively. Then we
can calculate bases for these subspaces by choosing appropriate singular vectors:
Definition 4.14 An n × n matrix P consisting of only zeros and ones and satisfying P ⊤ P = P P ⊤ = I is called a permutation
matrix.
Exercise 4.15 A permutation matrix can be viewed in two ways: (a) as a permutation of the rows of an identity matrix; or (b) as a
permutation of the columns of an identity matrix. Hence, it is common and useful to identify an n × n permutation matrix P with a
list of indices p = {i1 , i2 , . . . , in } formed by permuting the list {1, 2, . . . , n}. Show the following:
• Each row and each column of a permutation matrix has exactly one 1.
⊤
• Let x ∈ Rn , I be the n×n identity matrix, and define P := I[p, :], a row permutation of I. Then P x = x i1 x i2 ··· x in .
⊤
If P := I[:, p], a column permutation of I, then x P = xi1 xi2 · · · xin .
• Every n × n permutation matrix can be written as P := I[p, :] and as P := I[:, pe] for appropriate permutations p and pe of the
list {1, 2, . . . , n}.
• Multiplying a matrix A on the left by a permutation matrix (of appropriate size) permutes the order of its rows, while multiplying
it on the right permutes the order of its columns.
Definition 4.16 A possibly rectangular matrix L is lower triangular if all entries above the diagonal are zero. A possibly rectangular
matrix U is upper triangular if all entries below the diagonal are zero. Recall that the diagonal of an n × m matrix M consists
of all entries of the form mii , 1 ≤ i ≤ min{n, m}. L is uni-lower triangular if its diagonal consists of all ones. By a slight abuse
of terminology, we’ll allow an empty matrix to be called uni-lower triangular because its diagonal, being empty, has no terms that
violate the definition.
77
• If M is square and either upper or lower triangular, then its determinant is given by the product of the terms on its diagonal.
• If the lower triangular matrix L is square and has non-zero determinant, then the equation Lx = b can be solved by forward
substitution; see the ROB 101 textbook.
• If the upper triangular matrix U is square and has non-zero determinant, then the equation U x = b can be solved by back
substitution; see the ROB 101 textbook.
As a lead in to LU Factorization, you can check that if L is a lower triangular matrix and U is an upper triangular matrix, then in
general, their product A := LU is neither. Can this process be reversed? That is, given a generic square matrix, can it be factored
as the product of a lower-triangular matrix and an upper-triangular matrix? And if we can do such a factorization, would it be helpful?
Example 4.18 The goal here is to show you the secret sauce the underlies a very nice method for constructing the required triangular
matrices. We call it peeling the onion: starting from the top left corner and working down the diagonal, it successively zeros out
columns and rows of a matrix! Consider the square matrix
1 4 5
M = 2 9 17 .
3 18 58
Our goal is to find a column vector C1 and a row vector R1 such that
0 0 0
M − C 1 · R1 = 0 ∗ ∗ ,
0 ∗ ∗
where ∗ denotes “don’t care” in the sense that we do not care about their particular values. We want to zero out the first column and
the first row of M . That means, C1 and R1 are chosen so that the first column and first row of their matrix product C1 · R1 match the
first column and first row of M . How can you do that?
We perform a special case of “peeling the onion” that works when the top left entry equals 1.0. The general case will be treated later.
We define C1 and R1 to be the first column of M and the first row of M , respectively, that is
1
C1 = 2 and R1 = 1 4 5 .
3
Then
1 1 4 5
C1 · R1 = 2 · 1 4 5 = 2 8 10 ,
3 3 12 15
and voilà,
1 4 5 1 4 5
M = 2 9 17 and C1 · R1 = 2 8 10 .
3 18 58 3 12 15
Consequently,
1 4 5 1
M − C1 · R1 = 2 9 17 − 2 · 1 4 5
3 18 58 3
1 4 5 1 4 5
= 2 9 17 − 2 8 10
3 18 58 3 12 15
0 0 0
= 0 1 7 .
0 6 43
78
Oh! We have taken a 3 × 3 matrix and essentially made it into a 2 × 2 matrix!! Can we do this again? Let’s try. We define C2 and
R2 to be the second column and second row of M − C1 · R1 , that is
0
C2 = 1 and R2 = 0 1 7 .
6
Oh! Now we are essentially down to a 1 × 1 matrix!! You might be seeing the pattern! We very quickly note that if we define C3 and
R3 to be the third column and third row of M − C1 · R1 − C2 · R2 ,
0
C3 = 0 and R3 = 0 0 1 ,
1
then
0 0 0
C 3 · R3 = 0 0 0 ,
0 0 1
and hence, M − C1 · R1 − C2 · R2 − C3 · R3 = 03×3 . We prefer to write this as
R1
M = C1 · R1 + C2 · R2 + C3 · R3 = C1 C2 C3 · R2 .
| {z
L
} R3
| {z }
U
Moreover,
1 0 0
• L := C1 C2 C3 = 2 1 0 is uni-lower triangular,
3 6 1
R1 1 4 5
• U := R2 = 0 1 7 is upper triangular, and
R3 0 0 1
79
In the example, we arranged that the first non-zero entry of Ci was equal to one. This is a particularly special case chosen to illustrate
that, at least in some cases, one can systematically factor a matrix. We now build toward the general case.
Definition 4.19 A n × m matrix A is left zeroed of order 0 ≤ k ≤ min{n, m} if it has the form
" #
0k×k 0k×(m−k)
A= .
0(n−k)×k Ae
Remark 4.20 When k = 0, the matrices 0k×k , 0k×(m−k) , and 0(n−k)×k are empty. Hence, every matrix is left-zeroed of order zero.
When k = min{n, m}, then A
e is the empty matrix and hence A is identically zero.
Lemma 4.21 (Peeling the Onion) Suppose that A is an n × m left-zeroed matrix of order 0 ≤ k < min{n, m}. Then there exist a
permutation matrix P , a column vector C, and a row vector R such that
(a) P A − CR is left-zeroed of order (k + 1),
(c) the (k + 1)-st entry of C equals one (i.e., [C]k+1 = 1.0), and
(d) the first k rows of P are equal to the first k rows of the identity matrix.
Proof: We use proof by exhaustion to cover three cases. We let [A]ij and aij denote the ij entry of A, and arow
i and acol
j denote the
i-th row and j-th column, respectively.
Case 1: Suppose ak+1,k+1 ̸= 0. Then we define P = I, the identity matrix, C := acol row
k+1 /ak+1,k+1 , and R := ak+1 . Note that the
column C has been normalized by ak+1,k+1 , the first non-zero entry of C, and that ak+1,k+1 is also the first non-zero entry of R.
Moreover, by construction, the first k entries of C and R are zero. Based on these observations, we leave it to the reader to check that
[C · R]ij = [P A]ij
for i = k + 1 and j ∈ {k + 1, . . . , m} and for j = k + 1, and i ∈ {k + 1, . . . , n}. Hence, P A − CR is left-zeroed of order (k + 1).
□
Case 2: Suppose acol
k+1 = 0n×1 . Then we define P = I, the identity matrix, the column vector C by
(
1 i = (k + 1)
[C]i := ,
0 otherwise
[C · R]ij = [P A]ij
for i = k + 1 and j ∈ {k + 1, . . . , m} and for j = k + 1, and i ∈ {k + 1, . . . , n}. Hence, P A − CR is left-zeroed of order (k + 1).
□
Case 3: Suppose ak+1,k+1 = 0 and acol k+1 ̸= 0n×1 so that there exists k + 1 < ρ ≤ n such that aρ,k+1 ̸= 0. Then we define p to be
the unique permutation of {1, 2, . . . , n} such that
ρ
i=k+1
pi = k + 1 i = ρ
i i ̸∈ {k + 1, ρ}.
Upon defining P := I[p, :], the matrix Ǎ := P A now satisfies the conditions of Case 1 and its proof can be followed mutatis
mutandis (i.e., by making the necessary simple changes). We thus leave the rest of the proof to the reader. □
■
Theorem 4.22 (LU Factorization) Let A be an n×m real matrix and define r = min(n, m). There always exist an n×n permutation
matrix P , an n × r uni-lower triangular matrix L, and an r × m upper triangular matrix U such that
P · A = L · U.
80
Proof: We use proof by induction. At the base step k = 0, we set P0 := I, and L0 , U0 empty matrices.
(d) the first k rows of P are equal to the first k rows of the identity matrix.
(i) Pk+1 := P · Pk ;
(ii) Lk+1 := P · Lk C ; and
Uk
(iii) Uk+1 := .
R
Then
Uk
Lk+1 · Uk+1 = P · Lk C · = P · Lk · Uk + C · R,
R
and therefore
Because Lk is an n × k uni-lower triangular and the first k rows of P are equal to the identity matrix, it follows that P · Lk is also
uni-lower triangular. Because the first k entries of C are equal to zero and its (k + 1)-st entry is one, it follows that [P · Lk C] is an
n × (k + 1) uni-lower triangular matrix. Because Uk is a k × m upper triangular matrix and the first k entries of R are equal to zero,
it follows that [Uk⊤ R⊤ ]⊤ is a (k + 1) × m upper triangular matrix.
The algorithm stops at step r = min{n, m}, producing the required matrices.
■
Ax = b ⇐⇒ P · Ax = P · b ⇐⇒ L · U x = P · b.
Ly = P · b (4.2)
U x = y. (4.3)
Furthermore,
(P · A = L · U ) =⇒ det(A) = ± det(L) det(U )
and A is invertible if, and only if, both L and U are invertible. Our solution strategy is therefore to solve (4.2) by forward
substitution, and then, once we have y in hand, we solve (4.3) by back substitution to find x, the solution to Ax = b.
81
Example 4.23 Use LU Factorization to solve the system of linear equations
−2 −4 −6 x1 2
−2 1 −4 x2 = 3 . (4.4)
−2 11 −4 x3 −7
| {z }| {z } | {z }
A x b
Even though A admits an LU Factorization without row permutations, Julia inserts a permutation matrix. This is to improve the
numerical accuracy on large problems. On our small problem, it’s not really needed. Nevertheless, we’ll use it to show that we obtain
the same answer with essentially the same amount of work.
We first compute
1.0 0.0 0.0 2.0 2.0
P b = 0.0 0.0 1.0 3.0 = −7.0 .
0.0 1.0 0.0 −7.0 3.0
And finally, we use this result to solve U x = y for x, using back substitution,
−2.000 −4.000 −6.000 x1 2.0 x1 −8.0
0.000 15.000 2.000 x2 = −9.0 =⇒ x2 = −1.0 .
0.000 0.000 1.333 x3 4.0 x3 3.0
| {z } | {z } | {z }
U x y
4.4 LDLT or Cholesky Factorization (LU specialized for Positive Semi-definite Matri-
ces)
Matrices that are positive semi-definite can be factored in a special form. We send the reader to the ROB 101 textbook for further
details. To preserve the symmetric nature of positive semi-definite matrices when doing the factorization, it is necessary to use both
row and column perturbations.
82
Enhanced LU Factorization and Rank of a Matrix
A real positive semi-definite matrix M always has an LDLT Factorization (aka, Cholesky Factorization)
P · M · P ⊤ = L · D · L⊤ , (4.6)
where
• P is a (row) permutation matrix;
• P ⊤ , the transpose of P , permutes the columns of M ;
Remark 4.24 Recall that permutations in the LU Factorization only arise in Case 3 of Lemma 4.21 on “peeling the onion”. Theo-
rem 3.87 on Schur complements can be used to show that if Case 2 or Case 3 ever occurs, then M is not positive definite. Hence, for
positive definite matrices, a simpler factorization is possible.
83
84
Chapter 5
Learning Objectives
• Learn enough probability that we gain a clear understanding of the Kalman Filter
• As foundations for the Kalman Filter, cover Best Linear Unbiased Estimation (BLUE) and Minimum Variance Estimation
(MVE)
Outcomes
• Understand why probability is the most technical topic we have covered so far.
• Understand the notion of conditional probability and how it relates to “fusing” measurements.
• Our capstone topic in the Chapter is the Kalman filter, a recursive form of Minimum Variance Estimation.
5.1 Introduction
5.1.1 Intuition
Why use probability, much less even worry about what it means or how to use it properly? Let’s back up first and ask why do we use
mathematical models in Robotics, or in engineering?
When designing a system, it is common to combine mathematical models of individual components to predict the overall perfor-
mance of the system, assuming known or hypothesized characteristics of the individual components. This saves us a lot time in terms
of ordering the components, assembling, and then testing them. Followed by re-ordering, re-assembling, and re-testing everything
before we have something that is close to a satisfactory system.
My lab focuses a lot of feedback control of bipedal robots. We design our controllers on the basis of mathematical models and then
test the controllers on simulators that typically include terms that are more accurate than what we used in the control design model.
We follow this process because in the end, we obtain a higher performing system (robot plus controller) in less time than if we went
straight to the hardware and started “hacking” a controller together.
85
In summary, we use mathematical models to make predictions about how our robot would behave in an experiment, and we are
motivated to do this because it leads to better results in less time, not to mention, very few things break when running a simulation,
as opposed to when conducting actual experiments. When speaking of bipedal robots, you can imagine the models from Lagrangian
mechanics that we use for the robot. What about the ground? The camera and LiDAR? the IMU?
Models of terrain must account for varying slope and texture, as well as the presence of holes and obstacles. At what frequency do
changes in terrain characteristics occur and how do the characteristics of the terrain on which the robot is walking now depend on the
terrain it has just traversed? Similarly, cameras are affected by the illumination of objects and by dust or smoke in the air. LiDAR
returns are affected by surface texture and distance. The accelerometers in IMUs have biases (they read non-zero acceleration even
when the robot is at rest, and these readings can vary with the temperature of the device). To date, engineers have found it mostly
non-productive–if not absolutely impossible–to develop physics-based models for these effects. Instead, they have turned to descrip-
tions based in “probability.”
Probability has been approached through several lenses over the past few hundred years. In practice, the frequency interpretation is
fairly widely adopted, wherein probabilities are numbers in the interval [0, 1] that reflect the relative frequency of events, such as the
relative occurrences of heads to tails in a coin or the relative frequency that a given pixel in a camera image will correspond to grass
or a building, given that neighboring pixels have been identified as corresponding to a particular class of object. In the formal study
of Statistics and Mathematics, the “frequentist” interpretation of probability has proven inadequate, leading to a formal definition of
a probability space that parallels, in some sense, the formal definitions of fields, vector spaces, and normed spaces that we gave in
earlier chapters. It is unfortunate that an equally careful development of probability theory is beyond the scope of ROB 501. We’ll at
least let you know where we are coming up short!
• Starts with random vectors and moves into Gaussian or Normal random vectors
https://www.probabilitycourse.com/chapter6/6_1_5_random_vectors.php
5.1.3 (Optional Read) Probability Spaces Provide a Means to Formalize the Theory of Probability
This section is meant to justify us taking a simplified (relaxed) approach to probability in ROB 501; basically, to do it right, you need
to take Math 597 at Michigan. If you do not feel any particular need for a justification, then you can skip to the next section.
• Ω is the sample space. Think of it as the set of all possible outcomes of an experiment.
• E ⊂ Ω is an event.
• F is the collection of allowed events1 . It must at least contain ∅ and Ω. It is closed with respect to set complement, countable
unions, and countable intersections2 . Such sets are called sigma algebras https://en.wikipedia.org/wiki/%CE%
A3-algebra.
1 Though it is too deep for ROB 501, there are subsets of the reals that are so complicated one cannot even define a reasonable notion of “probability” that agrees
with how we would want to define a uniform probability on an interval, such as [a, b].
2 By De Morgan’s laws, once a set is closed under set complements and countable unions, it is automatically closed under countable intersections; see https:
//en.wikipedia.org/wiki/De_Morgan.
86
• P : F → [0, 1] is a probability measure. It has to satisfy a few basic operations
Question 5.3 The set of allowed events, F , being the set of all subsets of Ω was already a bit awkward to write down explicitly for
a die, but it was certainly doable. Is it always possible to write down F explicitly? And do we gain that much by doing it?
Once we leave the simple settings of dice, balls in urns, etc., things become a lot more technically challenging. Take for example,
Ω = S 1 , the circle, and instead of rolling a die, we spin a dial and check if it lands in a given subset of S 1 . Let’s identify S 1 with
the interval [0, 2π) ⊂ R and suppose we want to define a “uniform probability measure” on it, by which we mean, if a < b and
[a, b) ⊂ Ω, then P ([a, b)) = b−a
2π . This seems to be a natural continuous extension of a fair (uniform) die, from six possible outcomes
to a continuum of possible outcomes, where the notion of fairness is captured by two sets [a1 , b1 ) ⊂ Ω and [a2 , b2 ) ⊂ Ω having the
same probability whenever, b1 − a1 = b2 − a2 .
What is the set of allowable events F for this “uniform” probability measure? We already know it is “big” because there are an
uncountable number of intervals of the form [a, b) contained in Ω = [0, 2π). In the previous example, F contained all possible
subsets of Ω. However, in the present case, it is impossible to assign a probability to every subset E ⊂ Ω; said another way, there
are subsets of Ω that must be disallowed events. The proper way to define the events F is to use the “Lebesgue measurable sets”
in [0, 2π), but to fully define Lebesgue measure takes at least a week in a course on measure theory https://en.wikipedia.
org/wiki/Lebesgue_measure. Hence, in Engineering and Robotics, we mostly work with probability spaces in one of two
ways
2. taking F as the smallest sigma algebra generated by the half-open intervals [a, b) ∈ Ω, the set we obtain by applying the
operations of set complement and countable unions (called the Borel sigma algebra).
Furthermore, we place ourselves in contexts where we can integrate a density over a set to assign the probability of an event. In the
above case, if E = [0.1, 0.2) ∪ {0.5} ∪ (0.3, π], for example, we would compute the probability as
0.2 0.5 π
π − 0.3 π − 0.2
Z Z Z Z
1 1 1 1 0.1
P (E) = dx = dx + dx + dx = +0+ = ,
2π 0.1 2π 0.5 2π 0.3 2π 2π 2π 2π
A
which seems easy enough. The trick is to only compute probabilities of sets (i.e., events) that are simple enough that a Riemann
integral over the set (or at worst, a Riemann–Stieltjes integral) can be defined, thereby keeping us away from Lebesgue integration.
87
(a) (b)
Figure 5.1: Images of the Cantor set, thanks to Wikipedia. (a) The top line is the interval [0, 1]. The line below is what is left
when one removes the (open) middle third, (1/3, 2/3). The line below that is what is left when the next two (open) middle thirds,
(1/9, 2/9) and (7/9, 8/9), are removed, etc. In the beginning, it’s hard to believe that there is anything left over, but there is. In
fact, the Cantor set can be placed in one to one correspondence with the original interval [0, 1]. (b) Shows various paths traced out
in a ternary (base 3) expansion of numbers in the Cantor set; “each point in the Cantor set is uniquely located by a path through
an infinitely deep binary tree, where the path turns left or right at each level according to which side of a deleted segment the point
lies on. Representing each left turn with 0 and each right turn with 2 yields the ternary [expansion] for a point”, from Wikipedia
https://en.wikipedia.org/wiki/Cantor_set. Replacing the 2’s with 1’s yields a bijection from the Cantor set to the
interval [0, 1], which is cool and surprising, though otherwise irrelevant to ROB 501!
The Cantor set C ⊂ [0, 1] is a famous set that is uncountable and has Lebesgue measure 0,
∞
( )
X ϵi
C = x ∈ [0, 1] x = , ϵi ∈ {0, 2} .
i=1
3i
The classical construction given in https://en.wikipedia.org/wiki/Cantor_set shows that it belongs to the Borel
sigma algebra. It is impossible to define a uniform probability measure on [0, 1] and compute the probability of the Cantor set by
performing a Riemann–Stieltjes integral over C because the “index function”
(
1 x∈C
I(x) =
0 otherwise
In any case, whether you are convinced or not, in the rest of this Chapter, we are obliged to be less careful than we have been in other
parts of the book. When we discuss the probability of an event and compute it via an integral, we will simply assume that the integral
exists within the usual theory of Riemann integration.
agrees with how we would want to define the probability of an interval, such as [a, b].
4 By De Morgan’s laws, once a set is closed under set complements and countable unions, it is automatically closed under countable intersections; see https:
//en.wikipedia.org/wiki/De_Morgan.
88
• P : F → [0, 1] is a probability measure. It has to satisfy a few basic operations
Shortly, we will define a probability density. On the one hand, a density can be viewed as allowing us to define a probability space
on the range R of a random variable X : Ω → R. On the other hand, it can be viewed as replacing all of the confusing probability
space details with integrals over sets. It is fine to use the latter interpretation.
Figure 5.2: (Left) Normal distribution N (µ, σ) with µ = 0 and σ = 30. (Right) How do you determine the density? You have to
collect data! The figure shows a “fit” of a normal distribution to data.
Definition 5.5 For E ⊂ Ω, its set complement in Ω is often denoted in two ways, ∼ E := E c := {x ∈ Ω | x ̸∈ E}. Similarly, for
A ⊂ R, ∼ A := Ac := {x ∈ R | x ̸∈ A}.
Definition 5.6 A function X : Ω → R is a random variable if ∀ x ∈ R, the set {ω ∈ Ω |X(ω) ≤ x} ∈ F , that is P ({ω ∈
Ω |X(ω) ≤ x}) is defined.
With a random variable, we are typically interested in computing the probability that it takes values in a given set A ⊂ R, that is, we
seek to determine P {ω ∈ Ω |X(ω) ∈ A}. With the “frequentist” interpretation, we are asking how “frequently” does X take values
in the set A. This seems quite difficult to compute because we need to compute first, the inverse image of the set A under the function
X : Ω → R,
{ω ∈ Ω |X(ω) ∈ A}, (5.1)
and then secondly, compute the probability of this set using our probability space (Ω, F , P ). The notion of a density allows us to
circumvent the computation of (5.1) and work directly with the set A itself.
R∞
Definition 5.7 A (piecewise continuous5 ) function f : R → [0, ∞) is a probability density if −∞
f (x)dx = 1.0
89
• Laplace density: for b > 0, µ ∈ R, and x ∈ (−∞, ∞),
1 −|x−µ|/b
f (x) = e .
2b
The parameter µ is the mean value, defined below.
1 1 x−µ 2
f (x) = √ e− 2 ( σ ) .
σ 2π
The parameter µ is the mean value and σ is the standard deviation; see below.
The lower bound of −∞ is for convenience. It can be replaced with inf{X(Ω) ⊂ R}.
Notation 5.10 The notation X ∼ f is read as X is distributed with density f or that X is a random variable with density f .
Remark 5.11 Useful shorthand notation {X ≤ x} := {ω ∈ Ω |X(ω) ≤ x} and more generally, for A ⊂ R, we define {X ∈ A} :=
{ω ∈ Ω |X(ω) ∈ A}. □
Remark 5.12 Because F is closed under set complements, (countable) unions, and (countable) intersections, we can assign proba-
bilities to
Remark 5.13 For how general of a set A ⊂ R can we compute P ({X ∈ A})? To understand this, we note that
Z Z ∞
P ({X ∈ A}) := f (x)dx := IA (x)f (x)dx. (5.2)
A −∞
Because we are limiting ourselves to Riemann-Stiljes integrals, we need IA to be sufficiently “nice” that the product IA (x)f (x) has
bounded variation. A sufficient condition is that A can be expressed as a countable disjoint union of half-open intervals. This seems
to be general enough for use in engineering. □
90
5.2.2 Random Vectors and Densities
Let (Ω, F , P ) be a probability space. A function X : Ω → R is called a random vector if each component of
p
Definition
5.15
X1
X2
X = . is a random variable, that is, ∀ 1 ≤ i ≤ p, Xi : Ω → R is a random variable.
..
Xp
Consequently, ∀x ∈ Rp , the set {ω ∈ Ω | X(ω) ≤ x} ∈ F (i.e., it is an allowed event), where the inequality is understood pointwise,
that is,
X1 (ω) x1
X1 (ω) ≤ x1
X2 (ω) x2
X2 (ω) ≤ x2 \ p
{ω ∈ Ω | X(ω) ≤ x} := ω ∈ Ω | . ≤ . := ω ∈ Ω | = {ω ∈ Ω | Xi (ω) ≤ xi }.
.
.. .. ..
i=1
Xp (ω) xp Xp (ω) ≤ xp
Definition 5.16 X : Ω → Rp is a continuous random vector if there exists a density fX : Rp → [0, ∞) such that,
Z xp Z x2 Z x1
∀ x ∈ RP , P {X ≤ x} =
... fX (x̄1 , x̄2 ...x̄p )dx̄1 dx̄2 ...dx̄p .
−∞ −∞ −∞
p
More generally, for all A ⊂ R such that the indicator function IA has bounded variation,
Z ∞ Z ∞Z ∞
P {X ∈ A} = ... IA (x̄1 , x̄2 ...x̄p )fX (x̄1 , x̄2 ...x̄p )dx̄1 dx̄2 ...dx̄p .
−∞ −∞ −∞
Notation 5.17 The notation X ∼ f is read as X is distributed with density f or that X is a random vector with density f .
Definition 5.18 (Moments) Suppose g : Rp → Rk
Z Z ∞ Z ∞
E{g(X)} := g(x)fX (x)dx := ... g(x1 , ..., xp )fX (x1 , ..., xp )dx1 ...dxp
Rp −∞ −∞
91
5.3 Estimators
5.3.1 Best Linear Unbiased Estimator (BLUE)
Model: The measurement is y = Cx + ε, y ∈ Rm , x ∈ Rn , E{ε} = 0, cov{ε, ε} = E{εε⊤ } = Q > 0, and the columns of
C are linearly independent. We assume no stochastic (random) model for the unknown x. In other words, x ∈ Rn is an unknown
deterministic vector. It is emphasized that ε and hence y are random vectors while x is not a random vector. Do not let the lowercase
symbols being used for y, ϵ, and x distract you.
Seek: An estimate x
b of x that satisfies:
b = Ky for some n × m matrix K.
1. Linear: x
2. Unbiased for all x ∈ Rn : E{b
x − x} = 0 holds for all x ∈ Rn . It needs to hold for all x because x can be an arbitrary value in
n
R .
n
x − x)⊤ (b
x − x)} = E{ |xbi − xi |2 }, the variance of x
P
b minimizes E{(b
3. Best: x b − x.
i=1
Proof:
x − x} ∀x ∈ Rn
0 = E{b
⇕
0 = E{Ky − x} ∀x ∈ Rn
⇕
0 = E{K(Cx + ε) − x} ∀x ∈ Rn
⇕
0 = E{(KC − I)x} − E{Kε} ∀x ∈ Rn
⇕
0 = (KC − I)x ∀x ∈ Rn
where we used E{ε} = 0 by assumption and E{(KC − I)x} = (KC − I)x because x is deterministic. Finally, 0 = (KC − I)x ∀x ∈
Rn ⇐⇒ KC − I = 0n×n as can be seen by taking x = ei , for 1 ≤ i ≤ n.
□
Aside: For v, w ∈ Rn , (v + w)⊤ (v + w) = v ⊤ v + w⊤ w + v ⊤ w + w⊤ v = v ⊤ v + w⊤ w + 2v ⊤ w because v ⊤ w is a scalar.
x − x)⊤ (b
E{(b x − x)} = E{(KCx − x + Kε)⊤ (KCx − x + Kε)}
= E{x⊤ (KC − I)⊤ (KC − I)x + 2(Kε)⊤ (KC − I)x + ε⊤ K ⊤ Kε}
= E{2(Kε)⊤ (KC − I)x + ε⊤ K ⊤ Kε}
= E{ε⊤ K ⊤ Kε}
ε⊤ K ⊤ Kε = tr ε⊤ K ⊤ Kε = tr Kεε⊤ K ⊤ .
and therefore,
x − x)⊤ (b
E{(b x − x)} = tr E{Kεε⊤ K ⊤ }
= tr(KQK ⊤ ).
Hence,
x − x)⊤ (b
b = arg min E{(b
K b = arg min tr(KQK ⊤ )
x − x)} ⇐⇒ K
KC=I KC=I
92
A simplifying observation: If we partition K into rows via
k1
k2
K= .
..
kn
k1
.. ⊤
⊤
= k1⊤ kn⊤ kn⊤ = ni=1 ki Qki⊤ . Moreover,
P
then K ··· and tr . Q k1 ···
kn
KC = In×n ⇐⇒ C ⊤ K ⊤ = In×n
⇐⇒ C ⊤ k1⊤ · · · kn⊤ = e1
··· en
⇐⇒ C ⊤ ki⊤ = ei 1 ≤ i ≤ n.
Hence, we have n-separate optimization problems involving the column vectors ki⊤ , namely
which yields h i
b⊤ = b
K k1⊤ kn⊤ = Q−1 C(C ⊤ Q−1 C)−1 .
··· b
Therefore,
b = (C ⊤ Q−1 C)−1 C ⊤ Q−1 .
K
Theorem 5.22 (BLUE) Let x ∈ Rn , y ∈ Rm , y = Cx + ε, E{ε} = 0, E{εε⊤ } =: Q > 0, and rank(C) = n. The Best Linear
Unbiased Estimator (BLUE) is x
b = Ky
b where
b = C ⊤ Q−1 C −1 C ⊤ Q−1 .
K
Moreover, the covariance of the error is
⊤ −1
E{(b x − x) } = C ⊤ Q−1 C
x − x) (b .
Proof: The only thing left to show is the error covariance. From previous calculations,
b − x = Ky − x
x
= KCx + Kε − x
= Kε (because KC = I)
∴ E{(b x − x) } = E{(Kε)(Kε)⊤ }
x − x)(b ⊤
= E{Kεε⊤ K ⊤ }
= KQK ⊤
Hence,
⊤
E{(b x − x) } = KQK ⊤
x − x) (b
−1 −1
= C ⊤ Q−1 C C ⊤ Q−1 QQ−1 C C ⊤ Q−1 C
−1 ⊤ −1 ⊤ −1 −1
= C ⊤ Q−1 C C Q C C Q C
−1
= C ⊤ Q−1 C
.
■
93
Remark 5.23 Comparing Weighted Least Squares to BLUE:
• They are identical when the weighting matrix is taken as the inverse of the covariance matrix of the noise term: W = Q−1 .
The inverse of the covariance matrix is called the information matrix. Hence, there is low information when the variance (or
covariance) is large.
• Another way to say this, if you solve a least squares problem with weight matrix W > 0, you are implicitly assuming that your
uncertainty in the measurements has zero mean and a covariance matrix of Q = W −1 .
• If you know the uncertainty has zero mean and a covariance matrix of Q, using W = Q−1 makes a lot of sense. For simplicity,
assume that Q is diagonal. A large entry in Q means high variance, which means the measurement is highly uncertain.
Hence, the corresponding component of y should not be weighted very much in the optimization problem....and indeed, taking
W = Q−1 does just that because, the weight term W is small for large terms in Q.
• Weighted least squares was based on overdetermined systems of linear equations. To derive BLUE, we needed to understand
underdetermined systems of linear equations. That’s kind of cool!
Remark 5.24 E{εx⊤ } = 0 means that the state variables and noise terms are uncorrelated. Recall that uncorrelated does NOT
imply independence, except for Gaussian random vectors. □
Assumption: Q ≥ 0, P ≥ 0, and CP C ⊤ + Q > 0. (Will see why later.) We note that Q > 0 =⇒ CP C ⊤ + Q > 0 for all m × n
matrices C. If Q > 0, we do not need to assume the columns of C are linearly independent. In fact, C = 0m×n is possible.
Objective: We seek x
b that minimizes the variance
n
X n
X
x − x)⊤ (b
E{(b xi − xi ) 2 } =
x − x)} = E{ (b xi − xi )2 }.
E{(b
i=1 i=1
Remark 5.25 As for BLUE, we see that there are n separate optimization problems. We also see that a linear estimate x
b = Ky
would be automatically unbiased, because
Problem Formulation (the Unexpected Inner Product): We will pose this as a minimum norm problem in an inner product space
of random variables. Suppose that
x1 ε1
.. ..
x = . and ε = . .
xn εm
We recall that components of random vectors are random variables. Hence, xi , 1 ≤ i ≤ n and εj , 1 ≤ j ≤ m are all random
variables, and hence are functions. We define F = R, and X = span{x1 , x2 , . . . , xn , ε1 , ε2 , . . . , εm }, and for z1 , z2 ∈ X , we
define their inner product by
< z1 , z2 >:= E{z1 z2 }.
We note that ∀ z ∈ X , E{z} = 0 and var(z) =< z, z >. Hence, this inner product space is designed to treat minimum variance
optimization problems.
94
Remark 5.26
Pij z1 = xi , z2 = xj
Q
ij z1 = εi , z2 = εj
E{z1 z2 } =
0 z1 = xi , z2 = εj
0 z1 = εi , z2 = xj .
□
Define:
M = span{y1 , y2 , . . . , ym } ⊂ X (M is the subspace spanned by the measurements),
n
P
yi = Ci x + εi = Cij xj + εi , 1 ≤ i ≤ m, (i-th row of y)
j=1
Fact: {y1 , y2 , . . . , ym } is linearly independent if, and only if, CP C ⊤ +Q is positive definite. This is proven below when we compute
the Gram matrix. (Recall, {y1 , y2 , . . . , ym } linearly independent if, and only if G is full rank, where Gij :=< yi , yj > .)
G⊤ α
b=β
⇕
⊤
[CP C + Q]b
α = CPi
⇕
b = [CP C ⊤ + Q]−1 CPi
α
95
x
bi = α b 2 y2 + · · · + α
b 1 y1 + α b⊤ y = (row vector × column vector), where
bm ym = α
α
b1
..
α
b = . .
α
bm
Ki⊤ = α
bi = [CP C ⊤ + Q]−1 CPi , 1 ≤ i ≤ n
⇕
Kn = [CP C ⊤ + Q]−1 CP
⊤ ⊤
K1 ···
⇕
⊤
K = [CP C ⊤ + Q]−1 CP
⇕
K = P C ⊤ [CP C ⊤ + Q]−1
b = Ky = P C ⊤ [CP C ⊤ + Q]−1 y
x
Theorem 5.27 (Minimum Variance Estimator) Let x ∈ Rn , y ∈ Rm , y = Cx + ε, with the following stochastic assumptions:
• Zero Means: E{x} = 0, E{ε} = 0.
• Covariances: E{εε⊤ } = Q, E{xx⊤ } = P, E{εx⊤ } = 0.
b = P C ⊤ [CP C ⊤ + Q]−1 .
K
Proof: The only thing left to show is the error covariance. From previous calculations, let’s note that
and thus
x − x)⊤ = (KC − I)xx⊤ (KC − I)⊤ + Kεε⊤ K ⊤ − (KC − I)xε⊤ K ⊤ − Kεx⊤ (KC − I)⊤ .
x − x)(b
(b
96
Remark 5.28
P C⊤
x x ⊤ P
x⊤ C ⊤ + ε⊤ } =
1. cov( ) = E{ x .
y Cx + ε CP CP C ⊤ + Q
x
b − x is the Schur Complement of cov(x) in cov(
2. Hence, we see that the covariance of the error x ).
y
3. The term P C ⊤ [CP C ⊤ + Q]−1 CP represents the “value” of the measurements. It is the reduction in the covariance of x given
the measurement y.
4. If Q > 0 and P > 0, then from the Matrix Inversion Lemma
• P −1 = 0 roughly means P = ∞I, that is infinite covariance in x, which in turn means we have no idea about how x is
distributed.
• For MVE to exist, we can have dim(y) < dim(x) as long as (CP C ⊤ + Q) > 0.
□
Remark 5.29 (Solution to MIL) We will show that if Q > 0 and P > 0, then
P C ⊤ [CP C ⊤ + Q]−1 = [C ⊤ Q−1 C + P −1 ]−1 C ⊤ Q−1 .
Recall the MIL: Suppose that A, B, C and D are compatible6 matrices. If A, C, and (C −1 +DA−1 B) are each square and invertible,
then A + BCD is invertible and
(A + BCD)−1 = A−1 − A−1 B(C −1 + DA−1 B)−1 DA−1 .
We apply the MIL to [C ⊤ Q−1 C + P −1 ]−1 , where we identify A = P −1 , B = C ⊤ , C = Q−1 , D = C. This yields
[C ⊤ Q−1 C + P −1 ]−1 = P − P C ⊤ [Q + CP C ⊤ ]−1 CP.
Hence,
[C ⊤ Q−1 C + P −1 ]−1 C ⊤ Q−1 = P C ⊤ Q−1 − P C ⊤ [Q + CP C ⊤ ]−1 CP C ⊤ Q−1
= P C ⊤ I − [Q + CP C ⊤ ]−1 CP C ⊤ Q−1
= P C ⊤ [Q + CP C ⊤ ]−1 Q + CP C ⊤ − CP C ⊤ Q−1
97
5.4 Second Pass on Probability Basics
5.4.1 Marginal Densities, Independence, and Correlation
Suppose the random vector X : Ω → Rp is partitioned into two components X1 : Ω → Rn and X2 : Ω → Rm , with p = n + m, so
that,
X1
X=
X2
fX (x) = f" X # (x
1 , x2 ) = fX1 X2 (x1 , x2 )
1
X2
and it is called the joint density of X1 and X2 . As before, we can define the mean and covariance.
µ1 X1 E{X1 }
• Mean is µ = = E{X} = E{ }=
µ2 X2 E{X2 }
• Covariance is
⊤
Σ11 Σ12 X1 − µ1 X1 − µ1
Σ= = E{ }
Σ21 Σ22 X2 − µ2 X2 − µ2
X1 − µ1
(X1 − µ1 )⊤ (X2 − µ2 )⊤ }
= E{
X2 − µ2
(X1 − µ1 )(X1 − µ1 )⊤ (X1 − µ1 )(X2 − µ2 )⊤
= E{
(X1 − µ1 )(X2 − µ2 )⊤ (X2 − µ1 )(X2 − µ2 )⊤
where Σ12 = Σ⊤ ⊤
21 = cov(X1 , X2 ) = E{(X1 − µ1 )(X2 − µ2 ) } is also called the correlation of X1 and X2 .
X1
If X = : Ω → Rn+m is a continuous random vector, then its components
X2
X1 : Ω → Rn and X2 : Ω → Rm
are also continuous random vectors and have densities, fX1 (x1 ) and fX2 (x2 ). These densities are given a special name.
Definition 5.31 fX1 (x1 ) and fX2 (x2 ) are called the marginal densities of X1 and X2 .
Z ∞
fX2 (x2 ) := fX1 X2 (x1 , x2 )dx1
−∞
Z ∞ Z ∞
:= ··· fX1 X2 x̄1 , . . . , x̄n , x̄n+1 , · · · , x̄n+m dx̄1 · · · dx̄n
−∞ −∞ | {z } | {z } | {z }
x1 x2 dx1
For Normal Random Vectors, however, we can read the marginal densities directly from the joint density! We will not be doing any
iterated integrals in ROB 501.
□
98
Definition 5.33 Random vectors X1 and X2 are independent if their joint density factors
X1 and X2 are uncorrelated if their “cross covariance” or “correlation ” is zero, that is,
Fact 5.34 If random vectors X1 and X2 are independent, then they are also uncorrelated. The converse is in general false. □
6.66 10−4
T
P (A B) P (A)
P (A | B) = = = = 0.266.
P (B) P (B) 2.5 10−3
We note that P (A | B) = 0.266 is a lot more certain than P (A) ≈ 6.66 10−4 . Hence, as you can now imagine, conditional
probabilities are very important in engineering. □
fX1 X2 (x1 , x2 )
fX1 |X2 (x1 | x2 ) := .
fX2 (x2 )
Sometimes, for brevity, we simply write f (x1 | x2 ).
µX1 |X2 =x2 is a function of x2 . Think of it as a function of the value read by your sensor.
99
• Conditional Covariance:
ΣX1 |X2 =x2 := E{(X1 − µX1 |X2 =x2 )(X1 − µX1 |X2 =x2 )⊤ | X2 = x2 }
Z ∞
:= (X1 − µX1 |X2 =x2 )(X1 − µX1 |X2 =x2 )⊤ fX1 |X2 (x1 | x2 )dx1
−∞
ΣX1 |X2 =x2 is a function of x2 . Think of it as a function of the value read by your sensor.
□
5.4.3 (Optional Read) Derivation of the Conditional Density Formula from the Conditional Distribu-
tion:
Definition 5.40 Let X : Ω → Rp be a random vector. Then F : Rp → [0, 1] is a cumulative probability distribution function if
∀ x ∈ Rp , F (x) = P ({X ≤ x}).
Rx
Remark 5.41 If X ∼ f , then the cumulative distribution function and the density are related by F (x) = −∞ f (x)dx and f (x) =
∂
∂x F (x). By the definition of a density, the integral is well defined. □
We define A := {X1 ≤ x1 } and for ϵ > 0, define Bϵ := {x2 − ϵ ≤ X2 ≤ x2 + ϵ} where x2 ± ϵ means adding or subtracting ϵ to
each component of x2 . Then,
\ Z x1 Z x2 +ϵ
P (A Bϵ ) = fX1 X2 (x̄1 , x̄2 )dx̄2 dx̄1
−∞ x2 −ϵ
Z x2 +ϵ
P (Bϵ ) = fX2 (x̄2 )dx̄2
x2 −ϵ
Differentiating the distribution function with respect to x1 gives the density and hence
fX1 X2 (x1 , x2 )
(X1 | X2 = x2 ) ∼ fX1 |X2 (x1 | x2 ) = .
fX2 (x2 )
Alternatively, one can differentiate with respect to x1 first, and then take the limit as ϵ → 0,
R x2 +ϵ
fX1 X2 (x1 , x̄2 )dx̄2 fX1 X2 (x1 , x2 ) · 2ϵ fX1 X2 (x1 , x2 )
fX1 |X2 (x1 | x2 ) = lim x2R−ϵx2 +ϵ = lim = .
ϵ→0 fX2 (x̄2 )dx̄2 ϵ→0 f X2 (x 2 ) · 2ϵ fX2 (x2 )
x2 −ϵ
100
The standard deviation is σ > 0. The mean and variance satisfy
Z Z ∞
µ := E{X} := xfX (x)dx := xfX (x)dx
R −∞
Z Z ∞
2 2 2
σ := E{(X − µ) } := (x − µ) fX (x)dx := (x − µ)2 fX (x)dx.
R −∞
You should be quite familiar with the “bell curve”. X is also called a Gaussian random variable. We also say X has a univariate
normal distribution or a univariate Gaussian distribution to emphasize that we are talking about a single random variable.
For the most part, we do not care too much about individual random variables. We are interested in collections of random variables
and random vectors, and hence we are primarily concerned about jointly distributed random variables. If you take EECS 501, you
can learn a tremendous amount of material about this subject. In the following, you will see a bare bones accounting of multivariate
normal random variables.
Definition 5.43 A finite collection of random variables X1 , X2 , · · · , Xp , or equivalently, the random vector
X1
X2
X= .
..
Xp
has a (non-degenerate) multivariate normal distribution with mean µ and covariance Σ > 0 if the joint density is given by
1 1 ⊤
Σ−1 (x−µ)
fX (x) = p e− 2 (x−µ) .
(2π)p |Σ|
In the above, |Σ| = det(Σ), which must be non-zero for the denominator to be well defined. This condition is what is meant by
“non-degenerate”. When |Σ| = 0, one can still define a multivariate normal distribution, but the “moment generating function” must
be used. This is a technicality that we will skip. We note that
µ := E{X} ∈ Rp
Z Z ∞ Z ∞
µi := xi fX (x)dx := ··· xi fX (x1 , · · · , xp )dx1 · · · dxp
Rp −∞ −∞
Σ := cov(X, X) := E{(X − µ)(X − µ)⊤ } ∈ Rp×p
Z
Σij := E{(Xi − µi )(Xj − µj )} := (xi − µi )(xj − µj ) fX (x)dx
p
R
x1
x2
x = (x1 , x2 , · · · , xp ) or x = . (depending on context).
..
xp
Fact 5.44 (Marginal Densities/Distributions) Each random variable Xi has a univariate normal distribution with mean µi and
variance Σii ,
1 (x −µ )2
− i2Σ i
fXi (xi ) = √ e ii .
2πΣii
Hence, no iterated integrals are required to compute the marginal densities. Why is this true? Because the univariate density depends
on two terms, namely, µi := E{Xi } and σi2 := E{(Xi − µi )2 } = Σii . □
We note the unfortunate lack of coordination in the notation: the standard deviation of Xi , which we typically denote by σi , is given
by p
σi = Σii .
I guess we will not be denoting the entries of Σ with lower case σ.
101
Fact 5.45 (Independence) Gaussian random variables are very special in that they are independent if, and only if, they are uncorre-
lated. Hence, Xi and Xj are independent if, and only if, Σij = Σji = 0.
□
Fact 5.46 (Linear Combinations) Define a new random vector by Y = AX + b, with the rows of A linearly independent. Then Y
is a Gaussian (normal) random vector with
E{Y } = Aµ + b =: µY
cov(Y, Y ) = E{(Y − µY )(Y − µY )⊤ } = AΣA⊤ =: ΣY Y .
and AΣA⊤ > 0 when A has full row rank and Σ > 0. □
Remark 5.47 Taking b = 0 and A to be a row vector with all zeros except a one in the i-th spot, that is A = [0, · · · , 1, · · · , 0],
recovers the marginal densities discussed above. □
In addition to looking at individual random variables making up a random vector, we can group the components to form two or more
blocks of vectors as long as their sizes add up to p, the number of components in X. We abuse notation and write
X1 ∈ Rn
X=
X2 ∈ Rm
In books, you’ll often see the blocks expressed in bold font, such as X1 and X2 . We will NOT do this. Conformally with the partition
of X into two blocks, we partition the mean and covariance as follows
µ
µ =: 1
µ2
Σ11 Σ12
Σ =: .
Σ21 Σ22
From our results on the Schur complement, we know that Σ > 0 if, and only if, Σ22 > 0 and Σ11 − Σ12 Σ−1
22 Σ21 > 0.
Remark 5.49 To be extra clear on the dimensions, we suppose n + m = p and note that
µ1 = E{X1 } ∈ Rn
µ2 = E{X2 } ∈ Rm
Σ11 = cov(X1 , X1 ) ∈ Rn×n
Σ22 = cov(X2 , X2 ) ∈ Rm×m
Σ12 = cov(X1 , X2 ) ∈ Rn×m
Σ21 = cov(X2 , X1 ) ∈ Rm×n .
Σ⊤ ⊤ ⊤
11 = Σ11 , Σ22 = Σ22 , and Σ12 = Σ21 .
Fact 5.50 Each vector Xi has a multivariate normal distribution with mean µi and covariance Σii . This is also called the marginal
distribution of Xi . If we know the mean and covariance for the composite vector X, it is very easy to read off the marginal
distributions of its vector components. This can be established by Fact 5.46 after writing X1 = A1 X + b1 with b1 = 0n×1 and
A1 = [In×n , 0n×m ] and X2 = A2 X + b2 with b2 = 0m×1 and A2 = [0m×n , Im×m ]. □
102
Key Fact 1 (Conditional Distributions of Gaussian Random Vectors:) Let X1 and X2 be as above, namely they are components
of a larger vector X that has a multivariate normal distribution. Then the conditional distribution of X1 given X2 = x2 has a
multivariate normal distribution with
Mean : µ1|2 := µ1 + Σ12 Σ−122 (x2 − µ2 )
and
Covariance: Σ1|2 := Σ11 − Σ12 Σ−1
22 Σ21 .
In passing, we note that the mean depends on the value of x2 while the covariance does not. □
• µ1|2 = E{X1 | X2 = x2 }
• X1 given X2 = x2 is a random vector. It has a multivariate normal distribution with the above mean vector and covariance
matrix. Specifically, its density is
A proof of this can be found at the link below. The algebra is rather painful. If you are very ambitious, you can work out the spe-
cial case where X1 and X2 are scalars. This will not be on any exam in ROB 501; see http://fourier.eng.hmc.edu/e161/
lectures/gaussianprocess/node7.html; see also http://www.stats.ox.ac.uk/~steffen/teaching/bs2HT9/
gauss.pdf.
Remark 5.51 If X1 and X2 are uncorrelated, then µ1|2 = µ1 and Σ1|2 = Σ11 . Similarly, let’s suppose that Σ22 is large, specifically,
Σ22 = ρIm×m . Then, limρ→∞ µ1|2 = µ1 and limρ→∞ Σ1|2 = Σ11 . The term Σ12 Σ−1 22 Σ21 measures the value of the “information
gained by conditioning on X2 ”. As Σ−1
22 → 0, the value of the added information tends to zero.
□
Key Fact 2 (Conditional Independence) Suppose we have 3 vectors X1 , X2 and X3 that are jointly normally distributed:
X1
X = X2
X3
and that X1 and X3 are each independent of X2 . We then have no special structure on the means,
µ1
µ = µ2
µ3
103
We compute the covariance of X1 and X2 conditioned on X3 , that is
X1
X3 ,
X2
using the Schur complement from Key Fact 1
X1|X3 X1|X3 Σ11 0 Σ13
Σ−1
⊤
cov( , )= − 33 Σ13 0
X2|X3 X2|X3 0 Σ22 0
Σ11 − Σ13 Σ−1 ⊤
= 33 Σ13 0
0 Σ22
Because the off-diagonal blocks are zero, the two random variables X1|X3 and X2|X3 are uncorrelated, and because they are normal,
we conclude they are independent.
Once again, what we have seen is that if X1 and X2 are independent, and we also have X2 is independent of X3 , then X1 and
X2 remain independent when we condition them on X3 .
Key Fact 3 (Covariance of a Sum of Independent Normal Random Variables) Let X1 and X2 be independent normal random vectors,
with means µ1 and µ2 , and covariances, Σ11 and Σ22 . Define Y as a “linear combination” of X1 and X2 via
Y = AX1 + BX2
for appropriately sized matrices A and B. Then
µY = Aµ1 + Bµ2
and
cov(Y, Y ) = AΣ11 A⊤ + BΣ22 B ⊤ .
□
To see why this is true, we first note that
(Y − µY )(Y − µY )⊤ = A(X1 − µ1 )(X1 − µ1 )⊤ A⊤ + B(X2 − µ2 )(X2 − µ2 )⊤ B ⊤
+ A(X1 − µ1 )(X2 − µ2 )⊤ B ⊤ + B(X2 − µ2 )(X1 − µ1 )⊤ A⊤ ,
and then note that when expectations are taken on each side, the independence of X1 and X2 gives
E{(X1 − µ1 )(X2 − µ2 )⊤ } = 0 and E{(X2 − µ2 )(X2 − µ2 )⊤ } = 0.
Therefore,
cov(Y, Y ) = E{(Y − µY )(Y − µY )⊤ }
= AE{(X1 − µ1 )(X1 − µ1 )⊤ }A⊤ + BE{(X2 − µ2 )(X2 − µ2 )⊤ }B ⊤
= AΣ11 A⊤ + BΣ22 B ⊤ .
Key Fact 4 Suppose that X,
Y and
Z are jointly distributed random vectors with density fXY Z . Then the conditional random
Y
vectors (X|Z) (Y |Z) and X have the same conditional densities, that is,
Z
f(X|Z)(Y |Z) fXY Z Y
(X|Z) (Y |Z) ∼ = ∼X .
f(Y |Z) fY Z Z
□
The above fact does not require the random vectors to be jointly normal. The proof goes like this,
f" X #
Z fXY Z
f(X|Z)(Y |Z) Y fZ fXY Z Y
(X|Z) (Y |Z) ∼ = = fY Z
= ∼X .
f(Y |Z) fY |Z fY Z Z
fZ
In the Kalman Filter, the conditional density on the left will give us a recursive implementation of the density on the right, in place
of a batch update. You have to see it in action to believe it.
104
5.7 Discrete-time Kalman Filter
5.7.1 Model and Assumptions
Model: Linear time-varying discrete-time system with “white7 ” Gaussian noise
x ∈ Rn , w ∈ Rp , y ∈ Rm , v ∈ Rm . Moreover, the random vectors x0 , and, for k ≥ 0, wk , vk are all independent8 Gaussian (normal)
random vectors.
Precise assumptions on the random vectors We’ll denote δkl = 1 ⇐⇒ k = l and δkl = 0, k ̸= l.
The proof is by induction using the recursive nature of the discrete-time model. We skip it. The reader can easily fill it in.
In the next subsection, we give (one form of) the discrete-time Kalman Filter. After that, we provide the main elements of its
derivation. There are many variations of the basic filter, all equivalent to the one we give, but some preferable over others for
numerical reasons. Chapter 5.7.5 provides a version of the filter with the measurement update and prediction steps combined.
7 Recall that in white light, all frequencies are present. When only certain frequency components are present, you get “colored” light, such as blue light or red light.
The term “white” noise means that if you compute the power spectral density of the noise random process, it is a constant, meaning that all frequency components are
equally represented, just as in white light.
8 Recall that for normal random variables, uncorrelated and independent are the same thing. This is one of several special properties of Gaussian random variables.
105
5.7.2 Basic Kalman Filter
Definition of Terms:
Initial Conditions:
b0|−1 := x̄0 = E{x0 }, and P0|−1 := P0 = cov(x0 )
x
For k ≥ 0
x
bk+1|k = Ak x
bk|k
Pk+1|k = Ak Pk|k A⊤ ⊤
k + Gk Rk Gk
End of For Loop (Just stated this way to emphasize the recursive nature of the filter)
We could apply the MVE from Theorem 5.27 and do a batch computation. This would rapidly become inefficient as the number of
measurements grows over time. What we seek instead is a recursive form of minimum variance estimation, just as we created RLS,
a recursive version of weighted least squares; see Proposition 3.91 and Proposition 3.92. The main difference here is that we seek to
estimate more than the initial condition x0 . In fact, we seek to estimate xk for k ≥ 0.
Definition 5.53 (Measurements in the KF) We collect all of the measurements up to and including time k
Yk = (yk , yk−1 , · · · , y0 ).
Strictly speaking, we should be stacking them up into a column vector as we have done for all of our estimation problems, but
notationally, it is more convenient to write them in a row. Also, it is more convenient to put the most recent measurement at the head
of the list. We note that Yk = (yk , Yk−1 ).
106
Hence,
5.7.4 Filter Derivation Using Induction and Properties of Conditional Distributions of Gaussian Ran-
dom Vectors
Base step: The initial conditions of the filter at time k = 0, namely
x
b0|−1 := x̄0 , and P0|−1 := P0
Key idea of the development: We need to compute the distribution (or density) of the conditional random vector
From Key Fact 4, we have that X|(Y, Z) = X|Z Y |Z. From this we obtain
then we can apply Key Fact 1 to obtain (5.3). The following calculations are aimed at doing just this.
Measurement Update: We seek to derive the filter equations in Chapter 5.7.2. To begin, we have that yk = Ck xk + vk . It follows
by linearity that the conditional random variable yk |Yk−1 is equal to
By the assumptions on the noise model, vk is independent of both xk and Yk−1 , and hence by Key Fact 2, the conditional random
vector vk |Yk−1 is independent of the conditional random vector xk |Yk−1 . Moreover, because vk is independent of Yk−1 , we deduce
that vk |Yk−1 = vk . Putting this together, we have that
yk |Yk−1 = Ck xk |Yk−1 + vk ,
107
and xk |Yk−1 and vk are independent. Hence
ybk|k−1 :=E{yk |Yk−1 }
=E{Ck xk |Yk−1 } + E{vk }
=Ck E{xk |Yk−1 } + E{vk }
=Ck x
bk|k−1 + 0
=Ck x
bk|k−1 .
Moreover, the independence of xk |Yk−1 and vk with Key Fact 3 yields
cov(yk |Yk−1 , yk |Yk−1 ) = Ck Pk|k−1 Ck⊤ + Qk .
We use independence again to obtain
cov(xk |Yk−1 , yk |Yk−1 ) = cov(xk |Yk−1 , Ck xk |Yk−1 ) = Pk|k−1 Ck⊤ .
With this information, we conclude that the vector
xk
|Yk−1
yk
is jointly normally distributed, with mean and covariance
Pk|k−1 Ck⊤
x
bk|k−1 Pk|k−1
, (5.4)
Ck x
bk|k−1 Ck Pk|k−1 Ck Pk|k−1 Ck⊤ + Qk
As discussed above, to compute the distribution of (xk |Yk ), we have from Key Fact 4
and thus applying Key Fact 1 to (5.4) we compute the mean and covariance of xk |Yk = xk |Yk−1 yk |Yk−1 to be
−1
bk|k−1 + Pk|k−1 Ck⊤ Ck Pk|k−1 Ck⊤ + Qk
x
bk|k = x yk − Ck x
bk|k−1
−1
Pk|k = Pk|k−1 − Pk|k−1 Ck⊤ Ck Pk|k−1 Ck⊤ + Qk
Ck Pk|k−1 .
Remark 5.55 We note that Pk|k is the Schur complement Ck Pk|k−1 Ck⊤ + Qk in the covariance of
xk
|Yk−1
yk
Prediction or Time Update: We seek to derive the remaining filter equations in Chapter 5.7.2. This time we use the state-variable
model instead of the output model, namely
xk+1 = Ak xk + Gk wk ,
and we are interested in the conditional random vector
xk+1 |Yk = Ak xk |Yk + Gk wk |Yk .
Because xk and Yk are both independent of wk , by Key Fact 2, xk |Yk and wk |Yk are also independent. It follows that
bk+1|k = E{xk+1 |Yk }
x
= E{Ak xk + Gk wk |Yk }
= Ak E{xk |Yk } + Gk E{wk |Yk }
bk|k + Gk E{wk }
= Ak x
= Ak x
bk|k ,
where we have used wk |Yk = wk , and E{wk } = 0. Next, we use Key Fact 3 and the conditional independence of the random vectors
xk |Yk and wk |Yk to evaluate the covariance of xk+1 |Yk as
Pk+1|k = Ak Pk|k A⊤ ⊤
k + Gk Rk Gk .
108
That’s the Proof Folks! The famous Kalman Filter can be derived using four Key
Facts, where three of them depend on properties of conditional Gaussian random
vectors and one is true for general conditional random vectors that have densities.
xk+1 = Ak xk + Bk uk + Gk wk
yk = Ck xk + vk ,
with the assumptions on the random vectors x0 , wk and vk the same as before.
Combined Filter: The measurement-update step and time-update step of the Kalman Filter can be combined into a single step. The
algorithm becomes:
Initial Conditions:
b0|−1 := x̄0 = E{x0 }, and P0|−1 := P0 = cov(x0 )
x
For k ≥ 0
−1
Kk = (Pk|k−1 Ck⊤ ) Ck Pk|k−1 Ck⊤ + Qk
x bk|k−1 + Bk uk + Ak Kk yk − Ck x
bk+1|k = Ak x bk|k−1
Remark 5.56 You do not have to start at k = 0. In MATLAB, it is often easier to begin with k = 1. In that case, the initial conditions
are
xb1|0 := x̄0 = E{x0 }, and P1|0 := P0 = cov(x0 ).
−1
Remark 5.57 Kk Ck Pk|k−1 = (Pk|k−1 Ck⊤ ) Ck Pk|k−1 Ck⊤ + Qk
Ck Pk|k−1 is symmetric positive semi-definite and represents
the value of the measurement in reducing the covariance of the state estimate, just as in the MVE.
For a nonlinear system, the EKF will not be an exact computation of the conditional mean of xk given all of the measurements up to
time k, but rather, an approximate calculation of x
bk|k .
There are several other versions of “Extended Kalman Filters”, many of which are treated here https://en.wikipedia.
org/wiki/Kalman_filter. We particularly recommend you read about the unscented Kalman filter, which has a powerful
means for approximating the effect on nonlinear functions on Gaussian noise, https://en.wikipedia.org/wiki/Kalman_
filter#Unscented_Kalman_filter.
109
Definition of Terms: (Changes for EKF given in color)
Initial Conditions:
b0|−1 := x̄0 = E{x0 }, and P0|−1 := P0 = cov(x0 )
x
For k ≥ 0
∂h(x)
Ck :=
∂x x
bk|k−1
−1
Kk = Pk|k−1 Ck⊤ Ck Pk|k−1 Ck⊤ + Qk ( Kalman Gain)
x bk|k−1 + Kk yk − h(b
bk|k = x xk|k−1 )
Pk|k = Pk|k−1 − Kk Ck Pk|k−1
x
bk+1|k = f (b
xk|k , uk ) (Use the nonlinear model to predict the next state)
∂f (x, uk )
Ak := (Partial with respect to x only)
∂x x
bk|k
Pk+1|k = Ak Pk|k A⊤ ⊤
k + Gk Rk Gk
Yes, the EFK above is pretty much line for line the same as the standard KF. Recall our remarks at the beginning on alternative forms
of the EKF.
Remark 5.58 The EKF given above has been analyzed as a deterministic observer for a nonlinear discrete-time system. More
information can be found in the journal paper http://ece.umich.edu/faculty/grizzle/papers/ekf.pdf or the
conference version via DOI 10.23919/ACC.1992.4792775
xk+1 = Axk
yk = Cxk
110
Question 1: (Observability) When can we reconstruct the initial condition (x0 ) from the measurements y0 , y1 , y2 , . . . ?
y0 = Cx0
y1 = Cx1 = CAx0
y2 = Cx2 = CAx1 = CA2 x0
..
.
yk = CAk x0
Rewriting the above in matrix form yields,
y0 C
y1 CA
= x0
.. ..
. .
yk CAk
C
CA
We note that if rank . = n, then the null space consists of the zero vector and we can determine x0 uniquely on the basis
..
CAk
of the measurements.
C C
CA CA
The Caley Hamilton Theorem proves that rank . = rank for all k ≥ (n − 1).
..
.. .
CAk CAn − 1
C
CA
Fact 5.59 (Kalman observability rank condition.) We can determine x0 uniquely from the measurements if, and only if, rank =
..
.
CAn−1
n.
Question 2:(Full-State Luenberger Observer) Can we process the measurements dynamically (i.e. recursively) to “estimate” xk ?
Define
x̂k+1 = Ax̂k + L(yk − C x̂k ),
where the matrix L is to be chosen, and define the error to be ek := xk − x̂k . We seek conditions so that ek → 0 as k → ∞, because,
if ek → 0 because then x̂k → xk .
ek+1 = (A − LC)ek
Fact 5.60 Let e0 ∈ Rn and define ek+1 = (A − LC)ek . The the sequence ek → 0 as k → ∞ for all e0 ∈ Rn if, and only if,
|λi (A − LC)| < 1 for i = 1, . . . , n.
Fact 5.61 A sufficient condition for the existence of L : Rp → Rn that places eigenvalues of (A − LC) in the unit circle is:
C
CA
rank = n = dim(x)
..
.
CAn−1
.
111
Remark 5.62 L is called the Luenberger gain. When the models in the the Kalman Filter for the state variables and the noise terms
are time invariant, and the observability condition discussed above is met, the Kalman gain will also converge to a constant vector
or matrix, Kss , called the steady-state Kalman gain.
1. Reason to choose one gain over the other: Optimality of the estimate when you know the noise statistics.
2. Kalman Filter works for time-varying models Ak , Ck , Gk , etc.
(See http://en.wikipedia.org/wiki/Matrix_inversion_lemma#Blockwise_inversion)
and
µ1|2 = Λ−1
1|2 η1|2
□
112
Remark 5.66 (Marginal Distributions Using the Information Matrix) Getting the marginal distributions from the information form
of the distribution is more complicated. If you are interested, you can easily find it on the web or in most graduate level probability
texts. □
2
Given Fact: From (*), we have x̂ = Ky minimizes E{||x̂ − x||2 } if, and only if, for 1 ≤ i ≤ n, x̂i = ki y minimizes E{(x̂i − xi ) },
where
k1
k2
K= .
..
kn
Hints: (1) If M is 1 × n so that z = M x is a scalar, then z 2 = M xx⊤ M ⊤ ; (2) xi = e⊤
i x, where [e1 | e2 | · · · | en ] = In×n . (Yes,
[ei ]j = 1 ⇐⇒ i = j, and zero otherwise.)
Your work starts here. Everything above is assumed given and true. Determine which of the following statements are TRUE. At
least one answer is FALSE and at least one answer is TRUE.
b = Ky is unbiased 9 if, and only if, KC = I.
(a) x
(b) Let x
bi = ki y. Then
⊤
C ⊤ ki⊤ − ei C ⊤ ki⊤ − ei
2 P 0
xi − xi ) } =
E{(b
ki⊤ 0 Q ki⊤
where [e1 | e2 | · · · | en ] = In×n .
2
(c) Let x xi − xi ) } = ki (CP C ⊤ + Q)ki⊤ .
bi = ki y. Then E{(b
2
ki := arg min E{(ki (Cx + ϵ) − xi ) } denote the gain that minimizes the variance for a linear estimator x
(d) Let b bi = ki y. Then
ki can be obtained by minimizing the error of the over determined equation
b
Aki⊤ = b,
with
C⊤
ei
A= , b= ,
Im×m 0m×1
α
and ei is defined as in part (b); the norm on ∈ Rn+m is given by
β
⊤
α 2 α P 0 α
|| || = .
β β 0 Q β
9 Recall that an estimator is said to be unbiased when E{b
x} = E{x}
113
Solution: The answers are (b) and (d). We work the problem completely, then answer the questions.
x} = KCx, which
because E{x} = 0 and E{ϵ} = 0. In BLUE, x was deterministic, and thus E{x} = x instead of zero gave us E{b
was the source of the constraint, KC = I. In MVE, we do not have this restriction.
2 2
xi − xi ) = (ki [Cx + ϵ] − xi )
(b
2
= ki [Cx + ϵ] − e⊤
i x
2
= [ki C − e⊤
i ]x + ki ϵ
⊤
= [ki C − e⊤ [ki C − e⊤
i ]x + ki ϵ i ]x + ki ϵ
= [ki C − e⊤ ⊤ ⊤ ⊤ ⊤ ⊤ ⊤ ⊤ ⊤
i ]xx [C ki − ei ] + 2[ki C − ei ]xϵ ki + ki ϵϵ ki
where we have used the fact that a scalar squared is equal to the scalar times its transpose.
We now use the given statistical data to conclude that
2
xi − xi ) } = [C ⊤ ki⊤ − ei ]⊤ P [C ⊤ ki⊤ − ei ] + ki Qki⊤
E{(b
⊤ ⊤ ⊤ ⊤ ⊤
C ki − ei P 0 C ki − ei
= .
ki⊤ 0 Q ki⊤
It follows that
2 2
xi − xi ) } = arg min E{(ki (Cx + ϵ) − xi ) },
ki = arg min E{(b
b
with
C⊤
ei
A= , b= ,
Im×m 0m×1
and the norm on Rn+m given by
⊤
α 2 α P 0 α
|| || = .
β β 0 Q β
Indeed, applying our “Magic Formula” gives
−1
P 0 P 0
ki⊤ =
b A ⊤
A A ⊤
b
0 Q 0 Q
= (CP C ⊤ + Q)−1 CP ei .
Therefore
b ⊤ = (CP C ⊤ + Q)−1 CP
K
and
b = P C ⊤ (CP C ⊤ + Q)−1 ,
K
the result from lecture.
The answers to the questions are therefore:
(a) False. This was given in lecture.
10 The columns of A are linearly independent because the lower block is the identity matrix.
114
(b) True. This was a bit messy in terms of matrix algebra, but otherwise straightforward.
(d) True. If the answer were not given so that all you had to do was check it, then, yes, it would have been pretty tough to see it.
As given, it was not so bad. Nevertheless, the solution developed in lecture, applying the Projection Theorem to a vector space
of random variables, seems far superior to the method used here: (1) the method in lecture allowed a more general solution,
namely CP C ⊤ + Q ≻ 0 instead of P ≻ 0 and Q ≻ 0; (2) the Projection Theorem in lecture resulted in easier computations,
and each step was well motivated through the calculation of the Gram matrix.
115
116
Chapter 6
Learning Objectives
• In Robotics Engineering, we are constantly faced with problems for which no closed form solution is possible. For such
problems, we seek algorithms that compute solutions. In many cases, these algorithms do not terminate with an exact answer
in a finite number of steps. Hence, we need to understand how to formulate and analyze the convergence of iterative processes.
• Another class of problems involve maximizing or minimizing real valued functions. It is important to understand sufficient
conditions for extrema of real valued functions to exist.
Outcomes
• Understand the concepts of open and closed sets in normed spaces.
• Understand sequences and how to analyse if they have limits.
• Characterize open and closed sets in terms of distance (ie., using the norm directly) and in terms of convergent sequences.
• Understand in a rigorous manner what is a continuous function. While we saw the definition in Calculus, for most of us, it did
not stick!
• Learn an interesting topological property called compactness and how it relates to the existence of extrema of functions.
117
6.1 Open and Closed Sets in Normed Spaces
We start with a few preliminaries. Let (X , R, ∥ • ∥) be a real normed space. Recall from Chapter 3.1 that a norm is a function
∥ • ∥ : X → [0, +∞) satisfying
(a) ∥x∥ ≥ 0 and ∥x∥ = 0 ⇐⇒ x = 0
(b) ∥α · x∥ = |α| · ∥x∥ for all α ∈ R, x ∈ X
(c) ∥x + y∥ ≤ ∥x∥ + ∥y∥ for all x, y ∈ X .
In addition, recall that
(d) For x, y ∈ X , d(x, y) := ∥x − y∥.
(e) For x ∈ X , S ⊂ X a subset, d(x, S) := inf y∈S ∥x − y∥.
(f) A ⊂ B ⇐⇒ A ∩ (∼ B) = ∅.
Definition 6.1 Let x0 ∈ X and a ∈ R, a > 0. The open ball of radius a center at x0 is
p
Figure 6.1: R2 , ∥ • ∥ with various norms: (a) Euclidean norm ∥(x1 , x2 )∥2 = |x1 |2 + |x2 |2 ; (b) One norm ∥(x1 , x2 )∥1 =
|x1 | + |x2 |; and (c) Max norm ∥(x1 , x2 )∥∞ = max1≤i≤2 |xi |.
118
Lemma 6.2 (Characterization of distance zero and greater than zero) Let (X , ∥ • ∥) be a normed space, x ∈ X , and S ⊂ X . Then,
Figure 6.2: Why (d(x, S) > 0) ⇐⇒ (∃ ϵ > 0 such that Bϵ (x) ⊂ (∼ S)) ⇐⇒ (∃ ϵ > 0 such that Bϵ (x) ∩ S = ∅). We simply
take ϵ = d(x,S)
2 > 0.
Remark 6.4
P̊ = {p ∈ P | ∃ϵ > 0, Bϵ (p) ⊂ P }
= {p ∈ P | d(p, ∼ P ) > 0}
= {x ∈ X | d(x, ∼ P ) > 0}
Remark 6.6 Hence, P is open if, and only if, P = {x ∈ X | d(x, ∼ P ) > 0}.
119
• Is P = (0, 1) ⊂ (R, ∥ • ∥) open? We check it a second way. ∼ P = (−∞, 0] ∪ [1, ∞). We have x ∈ P ⇐⇒ 0 < x < 1.
Hence,
Fact 6.11 The rational numbers Q ⊂ R are neither closed nor open. Q = R.
Theorem 6.12 (Characterization of Open and Closed Sets using Distance) Let (X , ∥ • ∥) be a normed space and P ⊂ X a subset.
Then P is open if, and only if, ∼ P is closed.
P is closed ⇐⇒ ∼ P is open.
P is open ⇐⇒ ∼ P is closed.
∼ P =∼ (P̊ ) = {x ∈ X | d(x, ∼ P ) = 0} = ∼
| P {z
=∼ P}
| {z }
P is open ∼P is closed
P = P̊ ⇐⇒ P = {x ∈ X | d(x, ∼ P ) > 0}
⇐⇒ P = {x ∈ X | ∃ϵ > 0, Bϵ (x) ∩ P = ∅}
⇐⇒ P =∼ {x ∈ X | ∀ϵ > 0, Bϵ (x) ∩ P ̸= ∅}
⇐⇒ ∼ P = {x ∈ X | ∀ϵ > 0, Bϵ (x) ∩ P ̸= ∅}
⇐⇒ ∼ P = {x ∈ X | d(x, ∼ P ) = 0}
⇐⇒ ∼ P = ∼ P
120
(a) An arbitrary union of open sets is open.
(b) A arbitrary intersection of closed sets is closed.
(c) A finite intersection of open sets is open.
(d) A finite union of closed sets is closed.
Example 6.15 Consider the real numbers as a normed space; that is, X = R and define the norm to be ||x|| = |x|, the standard
absolute value. Is the infinite intersection of open sets
∞
\ 1
(−1 − , 1)
n=1
n
open?
∞
1 1
T
Solution: No. Indeed, ∀ n ≥ 1, [−1, 1) ⊂ (−1 − n , 1). Hence, by definition of the intersection, [−1, 1) ⊂ (−1 − n , 1).
n=1
Moreover,
∞
\ 1
[−1, 1) = (−1 − , 1),
n=1
n
1 1
because if x < −1, then there exits 1 ≤ K < ∞ such that x < −1 − K, which implies that x ̸∈ (−1 − K , 1), and thus
∞
\ 1
x ̸∈ (−1 − , 1).
n=1
n
Consider a function f : Rn → Rn that is continuously differentiable and suppose we seek a root f (x∗ ) = 0. Note that the domain
and range are both Rn and thus this is the nonlinear equivalent of solving a square linear equation Ax − b = 0.
The Newton-Raphson Algorithm is a vector version of Newton’s Algorithm; see Chapter 11 of the ROB 101 Textbook. Let xk ∈ Rn
be our current approximation of a root of the function f . We write the linear approximation of f about the point xk as
∂f (xk )
f (x) ≈ f (xk ) + · (x − xk ). (6.1)
∂x
We want to chose xk+1 so that f (xk+1 ) = 0, but we cannot do that exactly. Based on the linear approximation in (6.1), we have that
∂f (xk )
f (xk+1 ) ≈ 0 ⇐⇒ 0 ≈ f (xk ) + · (xk+1 − xk ). (6.2)
∂x
∂f (xk )
If det ∂x ̸= 0, we can solve for xk+1 , giving us the standard form of the Newton-Raphson Algorithm,
−1
∂f (xk )
xk+1 = xk − f (xk ).
∂x
121
The “hope” is that each iteration of the algorithm produces a better approximation xk to a root of the function f (x), where by a better
approximation, we mean that if x∗ is an unknown root of f , then in the limit as k gets sufficiently large, we can make ∥x∗ − xk ∥
arbitrarily small. We formalize these ideas in Chapter 6.3 with the notion of a converging sequence and in Chapter 6.5 on Contraction
Mappings. For now, we’ll simply see a numerical illustration of these ideas in action.
Based on (6.2), we definea xk+1 := xk + ∆xk , where ∆xk is our update to xk . We can then break the algorithm into two
steps,
∂f (xk )
∆xk = −f (xk ) (solve for ∆xk ) (6.3)
∂x
xk+1 = xk + ∆xk (use ∆xk to update our estimate of the root). (6.4)
While for toy problems, we can use the matrix inverse to solve (6.3) for ∆xk , for larger problems, we recommend using an
LU Factorization or a QR Factorization. Once (6.3) has been solved, xk+1 is updated in (6.4) and the process repeats.
for some ϵ > 0. The validity of the Newton-Raphson Algorithm rests upon:
Example 6.19 Find a root of F : R4 → R4 near x0 = −2.0 3.0 π −1.0 for
x1 + 2x2 − x1 (x1 + 4x2 ) − x2 (4x1 + 10x2 ) + 3
3x1 + 4x2 − x1 (x1 + 4x2 ) − x2 (4x1 + 10x2 ) + 4
F (x) = .
7
0.5 cos(x1 ) + x3 − (sin(x3 ))
−2(x2 )2 sin(x1 ) + (x4 )3
Solution: We programmed up (6.3) and (6.4) in Julia and used a symmetric difference approximation for the derivatives, with
h = 0.1. Below are the first five results from the algorithm:
k=0 k=1 k=2 k=3 k=4 k=5
−2.0000 −3.0435 −2.4233 −2.2702 −2.2596 −2.2596
xk = 3.0000
2.5435 1.9233 1.7702 1.7596 1.7596
3.1416 0.6817 0.4104 0.3251 0.3181 0.3181
−1.0000 −1.8580 −2.0710 −1.7652 −1.6884 −1.6846
and
k=0 k=1 k=2 k=3 k=4 k=5
−39.0000 −6.9839 −1.1539 −0.0703 −0.0003 −0.0000
f (xk ) = −36.0000 .
−6.9839 −1.1539 −0.0703 −0.0003 −0.0000
2.9335 0.1447 0.0323 0.0028 0.0000 −0.0000
15.3674 −5.1471 −4.0134 −0.7044 −0.0321 −0.0001
By iteration five, we have a good approximation of a root because ||f (x5 )|| ≈ 10−4 . To emphasize that as xk evolves, so does the
122
Jacobian of f at xk , we provide the Jacobians at the initial and final steps,
−19.0000 −42.0000 0.0000 0.0000 −8.5577 −15.1155 0.0000 0.0000
∂f (x0 ) −17.0000 −40.0000 0.0000 0.0000 ∂f (x 5 ) −6.5577 −13.1155 0.0000 0.0000
= and = .
∂x 0.4539 0.0000 1.0000 0.0000 ∂x 0.3854 0.0000 0.9910 0.0000
7.4782 10.9116 0.0000 3.0100 3.9296 5.4337 0.0000 8.5616
■
Hopefully, you now have in your mind that iteration is useful. We formalize the process of doing iterations through sequences of
vectors in normed spaces.
6.3 Sequences
Once again, we let (X , ∥ • ∥) be a normed space.
Definition 6.20 A set of vectors indexed by the non-negative integers is called a sequence. Common notion includes (xn ) or {xn }.
Definition 6.21 A sequence of vectors (xn ) converges to x ∈ X if, ∀ ϵ > 0, ∃N (ϵ) < ∞ such that, n≥N =⇒ ∥xn − x∥ < ϵ, i.e.,
n≥N → xn ∈Bϵ (x). One writes
lim xn = x or xn → x or xn −−−−→ x.
n→∞ n→∞
∥x∥ = ∥x − y + y∥
≤ ∥x − y∥ + ∥y∥
⇓
∥x∥ − ∥y∥ ≤ ∥x − y∥
2. Applying the definition of convergence of a sequence, we set ϵ = 1 and deduce ∃N (1) < ∞ such that n ≥ N =⇒ ∥xn −x∥ ≤
1. Hence, ∀n ≥ N, ∥xn ∥ = ∥xn − x + x∥ ≤ ∥xn − x∥ + ∥x∥ ≤ 1 + ∥x∥. It follows that
3. ∥x − y∥ = ∥x − xn + xn − y∥ ≤ ∥x − xn ∥ + ∥xn − y∥ −−−−→ 0 =⇒ x = y.
n→∞
123
2. If x ∈ P is not a limit point of P , then x is called an isolated point of P .
Proposition 6.25 (Characterization of Isolated Points) x is an isolated point of P if, and only if, there exists ϵ > 0 such that
Bϵ (x) ∩ P = {x}.
Proof: Suppose there exists ϵ > 0 such that Bϵ (x) ∩ P = {x}, and let (xn ) be such that for all n ≥ 1, xn ∈ P and xn ̸= x.
Then, for all n ≥ 1, d(xn , x) ≥ ϵ and hence xn ̸→ x. For the other direction, we suppose that ∀ϵ > 0, Bϵ (x) ∩ P ̸= {x}. Since
x ∈ Bϵ (x) ∩ P , we deduce that for all ϵ = 1/n, there exists xn ̸= x and xn ∈ Bϵ (x) ∩ P . Hence x satisfies all conditions of a limit
point, namely xn ∈ P , xn ̸= x, and limn→∞ xn = x. ■
Remark 6.26 Letx be an isolated point of P . Then x is an element of P and the trivial sequence (xn = x) → x. Moreover, if (xn )
is a sequence of elements in P and limn→∞ xn = x, then there exists N < ∞ such that, for all n ≥ N , xn = x. In other words, the
“tail” of the sequence is a trivial sequence.
Proposition 6.27 (Characterization of Set Closure using Limit Points and Isolated Points) Let Piso be the collection of all isolated
points of P and let P∞ be the collection of all limit points of P . Then
P = Piso ∪ P∞ .
Proof:
1. Suppose x is a limit point or an isolated point. Then, ∃(xn ) such that xn ∈ P and xn → x. Because xn → x, ∀ϵ > 0, ∃xn ∈ P
such that ∥xn − x∥ < ϵ, which implies that d(x, P ) = 0. Hence x ∈ P .
2. Suppose x ∈ P . Then, d(x, P ) = 0 and hence, for all n ≥ 1, B n1 (x) ∩ P ̸= ∅. Two cases are possible. For some N < ∞,
and all N ≥ N , B n1 (x) ∩ P = {x}, in which case, x ∈ Piso . Otherwise, for n ≥ 1, there exists xn ∈ B n1 (x) ∩ P such that
xn ̸= x, in which case, the sequence (xn ) establishes that x ∈ P∞ .
Corollary 6.28 P is closed if, and only if, it contains its limit points, that is,
P = P ⇐⇒ P∞ ⊂ P.
Proof: By definition, Piso ⊂ P . Hence, if P∞ ⊂ P , then Piso ∪ P∞ ⊂ P , which implies by Proposition 6.27 that P ⊂ P , and hence
P is closed. For the other direction, if P is closed, then Proposition 6.27 implies that P∞ ⊂ P , and hence the proof is done. ■.
Definition 6.29 A sequence (xn ) is Cauchy if ∀ ϵ > 0 ∃N (ϵ) < ∞, such that n ≥ N and m ≥ N =⇒ ∥xn − xm ∥ < ϵ.
The following captures the idea of terms in a sequence getting closer and closer together. The analysis will prepare us for the
Contraction Mapping Theorem.
Lemma 6.31 Let 0 ≤ c < 1 and let (an ) be a sequence of real numbers satisfying, ∀ n ≥ 1,
Proof:
Pf: First observe that |a3 − a2 | ≤ c|a2 − a1 | ≤ c2 |a1 − a0 | and then complete the proof by induction.
124
□
cn
Step 2: ∀ n ≥ 1, k ≥ 1, |an+k − an | ≤ 1−c |a1 − a0 |.
Pf:
Pf: Consider m and n and without loss of generality, suppose m ≥ n. If m = n, then |am − an | = 0. Thus, we can assume
m = n + k for some k ≥ 1. Then
cn 0
|am − an | = |an+k − an | ≤ |a1 − a0 | −−−−−−−−→,
1−c n→∞, m≥n
∥xn − xm ∥ = ∥xn − x + x − xm ∥
≤ ∥xn − x∥ + ∥x − xm ∥
ϵ ϵ
< +
2 2
< ϵ for all n, m ≥ N
■
Unfortunately, not all Cauchy sequences are convergent. For a reason we will understand shortly, all counter examples are infinite
dimensional.
R1
Example 6.33 Let X := {f : [0, 1] → R | f is continuous} and equip it with the one-norm, ∥f ∥1 := 0 |f (τ )|dτ . Define a sequence
as follow, where each function piecewise is linear and n ≥ 2,
0
0 ≤ t ≤ 12 − n1
fn (t) = 1 + n(t − 2 ) 21 − n1 ≤ t ≤ 12
1
t ≥ 12 .
1
Show that the sequence is Cauchy and does not have a limit in X .
Proof: You may want to check that each function is continuous at the breakpoints, 21 − n1 and 21 . Moreover, by using the area under
a triangle, you can show ∥fn − fm ∥1 = 12 | n1 − m
1
| −−−−−−→ 0, and thus the sequence is Cauchy.
n, m→∞
125
Figure 6.3: Visually, the sequence of continuous functions (fn (t))∞ n=2 appears to be converging to a step function, which is discon-
tinuous. If that is true, we then have a Cauchy sequence in (X , ∥ • ∥1 ) that does not have a limit in the set X .
How can we show that there does not exist any (continuous) function f ∈ X to which the sequence converges? We define
(
0 0 ≤ t < 21
fstep (t) :=
1 12 ≤ t ≤ 1
and note that fstep is discontinuous and hence fstep ̸∈ X . Define Y := span{X , fstep } ⊂ {f : [0, 1] : R}, the vector space of
all functions from the interval [0, 1] to the real numbers. You can check that ∥ • ∥1 is also a norm on Y and observe that f ∈ Y is
continuous if, and only if, f ∈ X . Finally, you can easily compute that
1
∥fn − fstep ∥1 = ,
2n
and hence, fn → fstep . By uniqueness of limits, there does not exist a continuous function in Y to which the sequence converges,
and thus the sequence does not have a limit in X .
■
The study of Cauchy sequences led mathematicians to wonder if it is possible to find normed spaces where all Cauchy sequences
do have limits (within the given normed space), and moreover, if a normed space was “deficient” in the sense that it had Cauchy
sequences without limit, could it be “enlarged” or “completed” to one where all Cauchy sequences do have limits. These are great
questions!
Definition 6.34 A normed space (X , R, ∥ · ∥) is complete if every Cauchy Sequence in X has a limit in X. Such spaces are also
called Banach spaces.
The above definition can only be useful if a list of useful Banach spaces is known; see https://en.wikipedia.org/wiki/
Banach_space#Examples_2. In EECS562, you will use (C [0, T ] , ∥ • ∥∞ ), the set of continuous functions with the infinity
(or max) norm. The sequence in Fig. 6.3 is not a Cauchy sequence in (C [0, T ] , ∥ • ∥∞ ); you might want to check that.
Fact 6.35 For a < b, both finite, (C [a, b], ∥ • ∥∞ ) is complete where C [a, b] = {f : [a, b] → R | f continuous}. We showed above
that (C[a, b], || • ||1 ) is not complete.
Definition 6.36 A subset S of a normed space is complete if every Cauchy Sequence in S has a limit in S.
126
Theorem 6.38 Let (X , || • ||) be a normed space. Then,
Fact 6.39 Every normed space (X , || • ||X ) has a “completion”. A bit loosely stated, this means there is a complete normed space
(Y, || • ||Y ) such that
(a) X ⊂ Y (X can naturally be viewed as a subset of Y. The precise definition involves isometric isomorphisms.)
(b) ∀x ∈ X , ||x||Y = ||x||X .
(c) X = Y (X is the closure of X in Y. Hence, X fits “tightly” into Y in that sense that for any point in y ∈ Y, d(y, X ) = 0).
(d) Y = X ∪ {limit points of Cauchy sequences in X }
You might ask about the completion of C[a, b] when the || • ||1 is used? It turns out to be the set of Lebesgue integrable functions on
[0, 1]. We alluded to Lebesgue back in Chapter 5.1.3.
Theorem 6.41 (Contraction Mapping Theorem) If T is a contraction mapping on a complete subset S of a normed space
(X , R, ∥ • ∥), then there exists a unique vector x∗ ∈ S such that T (x∗ ) = x∗ . Moreover, for every initial point x0 ∈ S, the
sequence xn+1 = T (xn ) , n ≥ 0, is Cauchy, and xn → x∗ .
Proof: Let (xn ) be defined as in the statement of the theorem. Because T is a contraction mapping, there exists 0 ≤ c < 1 such
that, for all n ≥ 1,
Claim 6.42 (xn ) is Cauchy and thus by the completeness of S, ∃x∗ ∈ S such that xn → x∗ .
Proof: We leave as an exercise to show by induction that ∥xn+1 − xn ∥ ≤ cn ∥x1 − x0 ∥. Next, consider ∥xm − xn ∥, and without loss
of generality, suppose m = n + k, k > 0. Then,
∥xm − xn ∥ = ∥xn+k − xn ∥
= ∥xn+k − xn+k−1 + xn+k−1 − · · · + xn+1 − xn ∥
≤ ∥xn+k − xn+k−1 ∥ + · · · + ∥xn+1 − xn ∥
≤ cn+k−1 + cn+k−2 + · · · + cn ∥x1 − x0 ∥
k−1
!
X
n i
=c c ∥x1 − x0 ∥
i=0
∞
!
X
n i
≤c c ∥x1 − x0 ∥
i=0
cn
= ∥x1 − x0 ∥ −
−−−→ 0
1−c n→∞
m>n
1
where we used a simple fact about the geometric series for 1−c . Therefore, (xn ) is Cauchy sequence in S, and by completeness,
∃x∗ ∈ S such that xn → x∗ . □
127
Proof: Let n ≥ 1 be arbitary. Then,
∥x∗ − y ∗ ∥ = ∥T (x∗ ) − T (y ∗ ) ∥
≤ c∥x∗ − y ∗ ∥.
The only non-negative real number γ that satisfies γ ≤ γc for some 0 ≤ <¸1 is γ = 0. Hence, due to the property of norms,
0 = ∥x∗ − y ∗ ∥ =⇒ x∗ = y ∗ .
□
■
Remark 6.45 The (local) convergence of the Newton-Raphson Algorithm is accomplished by identifying a closed ball in Rn on which
the function
−1
∂f
T (x) := x − ϵ (x) (f (x) − y)
∂x
is a contraction mapping. The estimate of a suitable value 0 ≤ c < 1 is based on a Lipschitz constant for the Jacobian. We check
that a solution of f (x) − y is a fixed point of T (x). Indeed,
x∗ = T (x∗ )
⇕
−1
∂f ∗
x∗ = x∗ − ϵ (x ) (f (x∗ ) − y)
∂x
⇕
−1
∂f ∗
0 = −ϵ (x ) (f (x∗ ) − y)
∂x
⇕
0 = (f (x∗ ) − y).
Figure 6.4: Let ϵ = 1.0 Then, ∀ δ > 0, ∃ x ∈ Bδ (x0 ) such that |f (x) − f (x0 )| ≥ ϵ. Indeed, x = x0 + δ2 works.
128
Definition 6.46 Let (X , ∥ • ∥), and (Y, ||| • |||) be normed spaces.
(a) f : X → Y is continuous at x0 ∈ X if ∀ϵ > 0, ∃δ (ϵ, x0 ) > 0 such that ∥x − x0 ∥ < δ =⇒ |||f (x)) ||| < ϵ.
(b) f is continuous if it is continuous at x0 for all x0 ∈ X .
Remark 6.47 It is also common to define continuity at a point as as ∀ ϵ > 0, ∃δ > 0 such that x ∈ Bδ (x0 ) =⇒ f (x) ∈
Bϵ (f (x0 )). Though less common, it can also be defined as ∀ ϵ > 0, ∃δ > 0 such that f (Bδ (x0 )) ⊂ Bϵ (f (x0 )).
Remark 6.48 (Discontinuous at a point) Let’s negate the definition of f continuous at x0 . f is discontinuous at x0 ∈ X if ∃ ϵ > 0
such that, ∀ δ > 0, ∃ x ∈ X such that ||x − x0 || < δ and |||f (x) − f (x0 )||| ≥ ϵ. This can also be stated as ∃ ϵ > 0 such that, ∀ δ > 0,
∃ x ∈ Bδ (x0 ) such that f (x) ̸∈ Bϵ (f (x0 )). Finally, it can also be stated as ∃ ϵ > 0, ∀δ > 0 such that f (Bδ (x0 )) ̸⊂ Bϵ (f (x0 )).
Theorem 6.49 (Characterization of Continuity at a Point via Sequences) Let Let (X , ∥ • ∥), and (Y, ||| • |||) be normed spaces and
f : X → Y a function.
(a) If f is continuous at x0 and the sequence (xn ) is a sequence in X that converges to x0 , then the sequence (yn := f (xn )) in Y
converges to f (x0 ). [yn := f (xn ), y0 := f (x0 ) implies yn → y0 when f is continuous at x0 . ]
(b) If f is discontinuous at x0 , then there exists a sequence (xn ) such that xn → x0 , and f (xn ) ̸→ f (x0 ), that is, f (xn ) does
not converge to f (x0 ).
The proof is done in HW 10. The main point is, just as sequences can be used to completely characterize closed sets, they can also
be used to completely characterize continuity at a point.
Corollary 6.50 f : X → Y is continuous at x0 if, and only if, every convergent sequence in X with limit x0 is mapped by f to a
convergent sequence in Y with limit f (x0 ). In other symbols, (f is continuous at x0 ) ⇐⇒ (xn → x0 =⇒ f (xn ) → f (x0 )).
Example 6.52 ni = 2i + 1 or ni = 2i .
Proof: The sequence (xn ) constructed above in Exercise 6.55 has no convergent subsequence. Indeed, by Remark 6.23, if (xni ) is a
subsequence of (xn ), then ∥xni − xnj ∥ ≥ | ∥xni ∥ − ∥xnj ∥ | ≥ |ni − nj |, and thus is not Cauchy. Because it is not Cauchy, it cannot
be convergent.
■
Definition 6.57 (Equivalent Norms) Let (X , R) be a vector space. Two norms || · || : X → [0, ∞) and ||| · ||| : X → [0, ∞) are
equivalent if there exist positive constants K1 and K2 such that, for all x ∈ X ,
129
-
We’ll next develop a few bounds for norms on finite dimensional vector spaces, and use the bounds to relate convergence of a
sequence of vectors in a finite dimensional normed space (X , || • ||) to the convergence of the representation of the sequence with
respect to a basis. Let {v} := {v 1 , v 2 , . . . , v n } be a basis for X and define Mi := span{v j |j ̸= i}, the (n−1)-dimensional subspace
spanned by all the basis vectors except the i-th one. Because Mi is finite dimensional, it is a complete and hence a closed subset of
X . By construction, v i ̸∈ Mi , and thus,
δi := d(v i , Mi ) > 0.
Lemma 6.59 Let {v} = {v 1 , v 2 , . . . , v n }, {M1 , M2 , . . . , Mn } and {δ1 , δ2 , . . . , δn } be as above. Then for any vector x = α1 v 1 +
α2 v 2 + · · · + αn v n ∈ X , !
Xn
∗ ∗
κ∗ max |αi | ≤ ||x|| ≤ κ |αi | ≤ nκ max |αi | , (6.6)
1≤i≤n 1≤i≤n
i=1
where κ∗ := min1≤i≤n {δi } > 0, κ∗ := max1≤i≤n {||v i ||} < ∞, and α := [x]{v} .
Proof: The right hand side of (6.6) follows from the triangle inequality for norms. The proof of the left hand side of (6.6) requires a
few steps that we carefully enumerate and leave to the reader:
1. If αi = 0, then d(αi v i , Mi ) = 0.
Corollary 6.60 (Equivalent Norms) All norms on finite dimensional vector spaces are equivalent1 .
Proof: Equation (6.6) shows that, once a basis is chosen, any norm on an n-dimensional normed space is equivalent to (Rn , ||•||max ).
We leave it to the reader to show that this is enough to prove the result. ■
Corollary 6.61 (Convergence of Components of Sequences in Finite-dimensional Spaces) Let (xk ) be a sequence in a finite di-
mensional normed space (X , || • ||) with basis {v 1 , v 2 , . . . , v n }. Let
αk := [xk ]{v} ∈ Rn
be the representation of xk with respect to the basis {v} so that xk = αk,1 v 1 + αk,2 v 2 + · · · + αk,n v n . Then (xk ) is a Cauchy
sequence in X if, and only if, each sequence (αk,i ) is a Cauchy sequence in (R, | • |), 1 ≤ i ≤ n.
Proof: Equation (6.6) reduces the proof to understanding that a sequence in Rn is Cauchy if, and only if, each of its real components
is Cauchy. But with the max-norm, (Rn , || • ||max ), this is quite easy to show because it selects the largest element, and if the largest
element is getting small, then the rest of them are too. Have fun with it! ■
Theorem 6.62 (Bolzano-Weierstrass Theorem or the Sequential Compactness Theorem) In a finite dimensional normed space
(X , || • ||), the following two properties are equivalent for a set C ⊂ X .
(b) Every sequence in C contains a convergent subsequence, that is, for every sequence (xn ) in C (i.e. xn ∈ C, ∀ n ≥ 1), there
exists x0 ∈ C and a subsequence (xni ) of (xn ) such that xni → x0 .
1 It is an interesting fact that a vector space is finite dimensional if, and only if, all norms on it are equivalent.
130
Figure 6.5: Illustration of a sequntial comactness proof from Wikipedia https://en.wikipedia.org/wiki/
Bolzano-Weierstrass_theorem.
131
Proof: We first show ∼ (a) =⇒ ∼ (b). There are two cases, C is not closed or C is not bounded.
Suppose that C is not closed. Then by Corollary 6.28, there exists a limit point x0 ∈ C such that x0 ̸∈ C. Hence, there exists a
sequence (xn ) with xn ∈ C and xn → x0 ̸∈ C. By Lemma 6.53, all subsequences (xni ) of (xn ) satisfy xni → x0 . Hence, we have
constructed a sequence of elements of C for which there is no subsequence with a limit in C.
Suppose next that C is unbounded. Then Exercise 6.55 produces a sequence of elements of C for which every subsequence is not
Cauchy, and hence cannot have a limit in C. This completes the proof of ∼ (a) =⇒ ∼ (b). Nothing we have done so far depends on
X being finite dimensional.
We now turn to (a) =⇒ (b). Let (xn ) be an arbitrary sequence built from elements of C. To show: it has a convergent subsequence
with limit in C.
Case 1: (xn ) has only a finite number of distinct elements. Hence, at least one value is repeated an infinite number of times; let’s call it
xN ∈ C. Because it is repeated an infinite number of times, we can choose a strictly increasing sequence n1 < n2 < · · · < ni < · · ·
such that xni = xN for all i ≥ 1. The subsequence (xni ) converges to xN ∈ C and hence we are done.
Case 2: (xn ) has an infinite number of distinct elements. Here, we will invoke that X is finite dimensional. Indeed, by Corollary 6.61,
a subsequence of (xn ) will be convergent if, and only if, once it it is represented with respect to a basis, each of its components is
convergent. Hence, it is enough to prove that every real sequence (an ), with an infinite number of distinct elements, contained in a
closed and bounded subset of C1 ⊂ R has a convergent subsequence. Every bounded subset of R is contained within a closed interval
of the form [−N, N ], for some integer 1 ≥ N < ∞, and hence we reduce ourselves to a set C1 ⊂ [−N, N ] and C1 contains an infi-
nite number of (distinct) elements of the sequence (an ). Because C1 contains an infinite number of distinct elements of (an ), some
closed interval of the form [n, n + 1], for |n| ≤ N must contain an infinite number of elements of (an ). Without loss of generality,
we will assume the closed interval [0, 1] contains an infinite number of elements of (an ); the reader will see that our argument works
mutatis mutandis for any other interval of length one.
Divide the interval [0, 1] into two halves, [0, 1] = [0, 1/2] ∪ [1/2, 1]. At least one of the halves must contain an infinite number of
elements of (an ). For the sake of argument, assume it is the right half. Now we divide in half [1/2, 1] = [1/2, 3/4] ∪ [3/4, 1] and note
that at least one of the halves must contain an infinite number of elements of the sequence (an ). For the sake of argument, assume
the left half this time. We then divide in half [1/2, 3/4] = [1/2, 5/8] ∪ [5/8, 3/4] and note that at least one of the halves must contain
an infinite number of elements of the sequence (an ).
At the k-th step, we have a closed sub-interval of Ik ⊂ [0, 1], with length Ik equalling 1/2k and Ik contains an infinite number of
distinct elements of (ak ). Hence, for all k ≥ 1, there exists nk such that nk > nk−1 and ank ∈ Ik . Moreover, the sequence (ank ) is
Cauchy because for all i ≥ k, j ≥ k, |ani − anj | ≤ 1/2k . Hence the subsequence (ank ) converges to a limit in C1 and the proof is
done.
■
Definition 6.63 A set C satisfying (a) or (b) of Theorem 6.62 is said to be compact.
Remark 6.64 There are various definitions of compactness that are appropriate for specific settings. The one above is typically
called sequential compactness. Because it is the only form of compactness we use in these notes, we will drop the term sequential
and simply call such sets compact sets.
Theorem 6.65 (Weierstrass Theorem) If C is compact and f : C → R is continuous, then f achieves its extreme values. That is,
and
∃x∗ ∈ C, s.t. f (x∗ ) = inf f (x).
x∈C
132
Proof: We do a proof by contradiction and suppose not, that is, suppose f ∗ = ∞. Then, by definition of the supremum, for all n ≥ 1,
there exists xn ∈ C such that f (xn ) ≥ n. Because C is compact, there exists x0 ∈ C and a subsequence (xni ) of (xn ) such that
xni → x0 . Because f is assumed continuous,
But because f : C → R, f (x0 ) ∈ R, which contradicts f (x0 ) = ∞. Hence, it cannot be the case that f ∗ = ∞.
□
Continuing with the proof, we now know that f ∗ is finite. Once again, applying the definition of the supremum, we have that
∀n > 0, ∃ xn ∈ C such that |f ∗ − f (xn )| < 1/n. Because C is compact, there exists x∗ ∈ C and a subsequence (xni ) such that
xni → x∗ . Because f continuous, f (xni ) → f (x∗ ). We expect to show that f (x∗ ) = f ∗ . To do so, we note that, by the continuity2
of f ,
1
|f ∗ − f (x∗ )| = lim |f ∗ − f (xni )| ≤ lim = 0.
i→∞ i→∞ ni
2 Technically, we are using the continuity of f implies that of g(x) := |f ∗ − f (x)|, which you are welcome to check. Otherwise, you can do the estimate as
|f ∗ − f (x∗ )| = |f ∗ − f (xni ) + f (xni ) − f (x∗ )| ≤ |f ∗ − f (xni )| + |f (xni ) − f (x∗ )| ≤ n1 + |f (xni ) − f (x∗ )| −−−−→ 0.
i i→∞
133
134
Chapter 7
Learning Objectives
• Learn about “convexity”, which defines a class of problems where local minima are also global minima
• Learn about two types of convex optimization problems that are fast enough to run in real time on many mobile platforms.
Outcomes
• Definition of convex sets and convex functions.
• Local minima of convex functions are also global minima.
• A Quadratic Program (QP) is an optimization problem with a quadratic cost and linear inequality and equality constraints.
• A Linear Program (LP) is an optimization problem with a linear cost and linear inequality and equality constraints.
• Through the concept of “slack variables”, an optimization problem with either a one-norm or ∞-norm as the the cost and linear
inequality and equality constraints can be turned into an LP.
135
7.1 Brief Remarks on Convex Sets and Convex Functions
Definition 7.1 Let (X , R) be a real vector space. C ⊂ V is convex if ∀x, y ∈ C and 0 ≤ λ ≤ 1, the convex combination
λx + (1 − λ)y ∈ C.
Remark 7.2
(a) For C to be convex, given any two points x, y ∈ C, then the line connecting x and y must also lie in C.
(b) Open and closed balls arsing from norms are always convex.
the set of all convex combinations of elements of S. It can also be defined as the smallest convex set that contains S.
136
Remark 7.5 For a function to be convex, the line λx + (1 − λ)y connecting x and y must lie at or above the graph of the function;
it can never go below.
Theorem 7.7 (Local equals Global for Convex Functions) If D and f are both convex, then any local minimum is also a global
minimum.
Proof: We show that if x is not a global minimum, then it cannot be a local minimum. Specifically, we prove the contrapositive
statement: ((a) =⇒ (b)) ⇐⇒ (¬(b) =⇒ ¬(a)), where
(a) x ∈ D is a local minimum
(b) x ∈ D is a global minimum.
¬(b) ∃y ∈ D such that f (y) < f (x).
¬(a) ∀ δ > 0, ∃ z ∈ Bδ (x) ∩ D such that f (z) < f (x).
Claim 7.8 If f (y) < f (x), then ∀0 < λ ≤ 1, the vector z := (1 − λ)x + λy satisfies f (z) < f (x).
Proof:
Proof:
The two Claims establish ¬(b) =⇒ ¬(a), and hence the proof is complete.
■
Fact 7.10
• All norms ∥ • ∥ : X → [0, ∞) are convex.
• For all 1 ≤ β < ∞, ∥ • ∥β is convex. Hence, on Rn , ∀ 1 ≤ p < ∞, Σni=1 |xi |p is convex.
• Let r > 0, ∥ • ∥ a norm, Br (x0 ) is a convex set; special case: B1 (0) convex set. (unit ball about the origin).
• Let C be an open, bounded and convex set, 0 ∈ C. Then, ∃ ∥ • ∥ : X → [0, ∞) such that C = {x ∈ X | ||x|| < 1} = B1 (0).
In other words, open unit balls are characterized by the fact that they are open, bounded, convex sets that contain the origin.
• K1 and K2 are convex, then K1 ∩ K2 is convex. (by convention, the empty set is convex)
• Consider (Rn , R) and let A be a real m × n matrix and b ∈ Rm . Then,
137
– K1 = {x ∈ Rn | Ax ≤ b} is also convex (linear inequality with Ax ≤ b interpreted row wise).
– K2 = {x ∈ Rn | Ax = b} is convex (linear equality).
– K3 = {x ∈ Rn | Aeq x = beq , Ain x ≤ bin } is convex as well by the intersection property.
e ≥ eb ⇐⇒ (−A)x
Remark 7.11 Ax e ≤ (−eb).
Fact 7.12 (Not an Easy one to Prove) Suppose (X , R, || • ||) is a finite dimensional normed space, C ⊂ X is convex, and f : C → R
is convex. Then f is continuous on C̊.
Remark 7.13 f can have jumps on the boundary of C, that is, on ∂C := C ∩ (∼ C) = C \ C̊ := {x ∈ C | x ̸∈ C̊}.
for the value of x ∈ S that achieves the minimum of f over all possible elements of S; that is, f (x∗ ) = minx∈S f (x). Now, the
problem is, you should only write something like (7.1) when
2. it is unique.
to indicate that x∗ is one of a set of values that all minimize the function f over the set S. It is correct notation, but not commonly
used. If you are not sure that a minimum exists, then definitely you should not use (7.1). Even worse, something you should never
do is write
x∗ = arg inf f (x) (7.3)
x∈S
because it makes no sense! By the very definition of an infimum, there may be no value in S achieving the infimum.
with Q a positive definite matrix, while Theorem 3.51 and Proposition 3.91 dealt with minimum norm squared solutions of underde-
termined systems of linear equations
b := arg min x⊤ Qx.
x
Ax=b
These are both quadratic optimization problems that admit closed-form solutions.
A Quadratic Program is a more general kind of quadratic optimization problem with constraints. The cost to be minimized in (7.1)
is quadratic plus a linear term, meaning that f : Rm → R has the form
1 ⊤
f (x) = x Qx + qx, (7.4)
2
where Q is an m × m symmetric, positive semi-definite matrix, and q is a 1 × m row vector. Moreover, instead of optimizing over all
of Rm as in Example 3.43 and Proposition 3.95, or only over linear equality constrains, as in Theorem 3.51 and Proposition 3.91, we
138
are allowed to seek solutions that lie in a subset of Rm defined by linear inequality and linear equality constraints that are typically
written in the form
Recall that the symbol ⪯ is a way to define “less than or equal to” for vectors; it means that each component of the vector on the left
hand side is less than or equal to the corresponding component of the vector on the right hand side. As an example
3 4
2 ⪯ 3 ,
4 4
though
3 1
2 ̸⪯ 3 ;
4 4
and
3 1 x1 0
⪯ ,
2 4 x2 9
means that x1 and x2 must satisfy
3x1 + x2 ≤ 0
2x1 + 4x2 ≤ 9.
What if you really wanted 2x1 + 4x2 ≥ 9? Then you need to remember that when you multiply both sides by a minus sign, the
inequality sign flips. Hence,
3x1 + x2 ≤ 0 3x1 + x2 ≤ 0
3 1 x1 0
⇐⇒ ⇐⇒ ⪯ .
2x1 + 4x2 ≥ 9 −2x1 − 4x2 ≤ −9 −2 −4 x2 −9
In addition, most QP solvers allow one to specify lower and upper bounds on x of the form
lb ⪯ x ⪯ ub. (7.7)
While such constraints could always be rewritten in the form of (7.5), using (7.7) is more convenient, intuitive, and less error prone.
The inclusion of constraints allows for very interesting and practical optimization problems to be posed.
We consider the QP
1 ⊤
x∗ = arg min x Qx + qx (7.8)
2
x ∈ Rm
Ain x ⪯ bin
Aeq x = beq
lb ⪯ x ⪯ ub
and assume that Q is symmetric (Q⊤ = Q) and positive definite (x ̸= 0 =⇒ x⊤ Qx > 0), and that the subset of Rm defined
by the constraints is non empty, that is
139
• https://web.stanford.edu/~boyd/papers/pdf/osqp.pdf
• https://stanford.edu/~boyd/software.html
• https://www.mathworks.com/help/optim/ug/quadprog.html
• https://www.mathworks.com/help/mpc/ug/qp-solver.html
How do QPs arise in Robotics? Here is one example. Consider the robot equations,
where q ∈ Rn , u ∈ Rm . The ground reaction forces for a bipedal robot can be expressed as
Fh
F = Λ0 (q, q̇) + Λ1 (q)u = .
Fv
Suppose the commanded torque is u = γ(q, q̇) and we need to respect bounds on the ground reaction forces, such as
F v ≥ 0.2mtotal g,
which means the normal force should be at least 20% of the total weight of the robot, and
|F h | ≤ 0.6F v ,
which places the horizontal component of the ground reaction force in a friction cone with magnitude less than 60% of the total
vertical force. Putting it all together:
Fv
≥ 0.2mtotal g
Fh ≤ 0.6F v ⇐⇒ Ain (q)u ≤ bin (q, q̇).
−F h ≤ 0.6F v
QP:
u∗ = argmin u⊤ u + p d⊤ d
where d is a relaxation parameter that allows the torque to deviate from its desired value in order to respect the constraints on the
ground reaction forces. The parameter p is a scalar weight term.
7.4 What is a Linear Program and How can it be used to Minimize ||•||1 and ||•||max ?
Definition 7.14 A Linear Program means to minimize a scalar-valued linear function subject to linear equality and inequality
constraints. For x ∈ Rn , and f ∈ Rn
minimize f ⊤ x
subject to Ain x ⪯ bin
Aeq x = beq
where Ain x ⪯ bin means each row of Ain x is less than or equal to the corresponding row of bin . The only restrictions on Ain and
Aeq are that the set
K = {x ∈ Rn | Ain x ⪯ bin , Aeq x = beq }
should be non-empty.
140
Remarkably, through a concept called slack variables, Linear Programs can be applied to cost functions that include the one-norm
and the max-norm.
Pn
Linear Program for ℓ1 -norm: ||x||1 = i=1 |xi |
Suppose that A is an m × n real matrix. Minimize ||Ax − b||1 is equivalent to the following linear program on Rn+m
minimize f ⊤ X
(7.10)
subject to Ain X ⪯ bin
x
with X = (s ∈ Rm are called slack variables)
s
A −Im×m b
f := 01×n 11×m , Ain := and bin :=
−A −Im×m −b
Let’s see if we can understand why the above is true. Writing out the terms, equation (7.10) becomes
m
X
minimize si
x,s
i=1
subject to Ax − s ⪯ b
− Ax − s ⪯ −b.
This is equivalent to
m
X
minimize si
x,s
i=1
subject to − s ⪯ b − Ax
− s ⪯ −(b − Ax),
which is equivalent to
m
X
minimize si
x,s
i=1
subject to − s ⪯ b − Ax
+ s ⪰ b − Ax,
which is equivalent to
m
X
minimize si
x,s
i=1
subject to − s ⪯ b − Ax ⪯ s.
Because for real numbers si , yi , the inequality (−si ≤ yi ≤ si ) ⇐⇒ (0 ≤ |yi | ≤ si ), we end up with
m
X
minimize si
x,s
i=1
subject to 0 ≤ |b − Ax|i ≤ si ,
141
which is equivalent to
m
X
minimize |b − Ax|i .
x
i=1
Whoever thought this up was pretty clever! It reduces a nonlinear problem to a linear problem. The max-norm is a bit simpler,
requiring only a single slack variable.
Suppose that A is an m × n real matrix. Minimize ||Ax − b||∞ is equivalent to the following linear program on Rn+1
minimize f ⊤ X
subject to Ain X ⪯ bin
x
with X = (s ∈ R is called a slack variable)
s
A −1m×1 b
f := 01×n 1 , Ain := and bin :=
−A −1m×1 −b
142