20 Notes 6250 f13

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Iterative methods for solving least-squares

When A has full column rank, our least-squares estimate is


x̂ = (ATA)−1ATy.
If A is M × N , then constructing ATA costs O(M N 2) computa-
tions, and solving the N × N system ATAx = ATy costs O(N 3)
computations. (Note that for M ≥ N , the cost of constructing the
matrix actually exceeds the cost to solve the system.)

This cost can be prohibitive for even moderately large M and N . But
inverse problems with large M and N are common in the modern
world. For example, a typical 3D MRI scan will try to reconstruct a
128 × 128 × 128 cube of voxels from about 5 million non-uniformly
spaced samples in the spatial Fourier domain. In this case, the matrix
A, which models the MRI machine, is M = 5 · 106 by N = 2.1 · 106.

In this case, M N 2 is huge (∼ 1019); even storing the matrix ATA


in memory would require terabytes of RAM.

In this section, we will overview two iterative methods — steepest


descent and conjugate gradients — that reformulate
ATAx = ATy
as an optimization program and then solve it by an iterative descent
method. Each iteration is simple, requiring one application of A and
one application of AT.

If ATA is well-conditioned, then these methods can converge in very


few iterations (especially conjugate gradients). This makes the cost

22
Georgia Tech ECE 6250 Notes by J. Romberg. Last updated 15:39, November 8, 2013
of solving a least-squares problem dramatically smaller — about the
cost of a few hundred applications of A.

Moreover, we will not need to construct ATA or even A explicitly.


All we need is a “black box” which takes a vector x and returns Ax.
This is especially useful if it takes  O(M N ) operations to apply
A or AT.

In the MRI example above, it takes about one second to apply ATA,
and the conjugate gradients method converges in about 50 iterations,
meaning that the problem can be solved in less than a minute. Also,
the storage requirement is on the order of O(M + N ), rather than
O(M N ).

Recasting sym+def linear equations as an optimization


program
We want to solve
T T
| {zA} x = A
A y.
| {z }
H b
Note that H is symmetric and positive definite (the latter being true
if A has full column rank). This means that the following quadratic
program is convex:
1
minN xTHx − xTb. (1)
x∈R 2
The consequence of this is that a necessary and sufficient condition
for x̂ to be the minimizer of (1) is the gradient (evaluated at x̂) to
be zero: 
1 T

x Hx − xTb

∇x = 0.
2 x=x̂

23
Georgia Tech ECE 6250 Notes by J. Romberg. Last updated 15:39, November 8, 2013
Since
1 T 1
 
∇x x Hx − xTb = ∇x(xTHx) − ∇x(xTb)
2 2
= Hx − b,
this means that x̂ is the solution to (1) if and only if
H x̂ = b.

Steepest descent
Say you have an unconstrained optimization program
min f (x)
x∈RN

where f (x) : RN → R is convex. One simple way to solve this


program is to simply “roll downhill”. If we are sitting at a point
x0, then f (·) decreases the fastest if we move in the direction of the
negative gradient −∇f (x)|x=x0 .
From a starting point x0, we move to
x1 = x0 − α0 ∇f (x)|x=x0
then to
x2 = x1 − α1 ∇f (x)|x=x1
...
xk = xk−1 − αk−1 ∇f (x)|x=xk−1 ,
where the α0, α1, . . . are appropriately chosen step sizes.

24
Georgia Tech ECE 6250 Notes by J. Romberg. Last updated 15:39, November 8, 2013
8 Jonathan Richard Shewchuk
2

1
-4 -2 2 4 6

0 -2

-4

-6

Figure 8: Here, the method of Steepest Descent starts at 2 2 and converges at 2 2 .

(from Shewchuk, “... without the agonizing pain”)


Putting it all together, the method of Steepest Descent is:

(10)

(11)
For our particular optimization problem
1 (12)
1 T 8. Note the zigzag
Tpath, which appears because each
min x Hx − x b,
The example is run until it converges in Figure
gradient is orthogonal to the previous gradient.
x 2
The algorithm, as written above, requires two matrix-vector multiplications per iteration. The computa-
tional cost of Steepest Descent is dominated by matrix-vector products; fortunately, one can be eliminated.
we can explicitly compute both the gradient and the best choice of
By premultiplying both sides of Equation 12 by and adding , we have

step size. The (negative) gradient is what we call the residual,


1 (13) the
differenceAlthough
betweenEquation 10b and
is still neededH applied
to compute to13 the
, Equation 0 current
can be used iterate:
for every iteration thereafter.
The product , which occurs in both Equations 11 and 13, need only be computed once. The disadvantage
of using this recurrence is that the sequence defined by Equation 13 is generated without any feedback from

1

the value of , so that accumulation of floating point roundoff error may cause to converge to some
T can be avoided byT

− ∇ x Hx − x b
point near . This effect = b − Hxk =: r k .
periodically using Equation 10 to recompute the correct residual.
2 the convergence of Steepest Descent,
Before analyzing x=xkI must digress to ensure that you have a solid
understanding of eigenvectors.

The steepest descent iteration can be written as


xk+1 = xk + αk r k .

There is a nifty way to choose an optimal value for the step size αk .
We want to choose αk so that f (xk+1) is as small as possible. It is

25
Georgia Tech ECE 6250 Notes by J. Romberg. Last updated 15:39, November 8, 2013
not hard to show that along the ray xk + αr k for α ≥ 0,
f (xk + αr k ) is convex as a function of α.
Thus we can choose the value of α that makes the derivative of this
function zero; we want
d
f (xk + αr k ) = 0.

By the chain rule,
d d
f (xk+1) = ∇f (xk+1)T xk+1
dα dα
T
= ∇f (xk+1) r k .
So we need to choose αk such that
∇f (xk+1) ⊥ r k ,
or more concisely
r k+1 ⊥ r k (r Tk+1r k = 0).
So let’s do this
r Tk+1r k =0
⇒ (b − Hxk+1)Tr k =0
⇒ (b − H(xk + αk r k ))Tr k =0
⇒ (b − Hxk )Tr k − αk r Tk Hr k =0
⇒ r Tk r k − αk r Tk Hr k =0
and so the optimal step size is
r Tk r k
αk = T .
r k Hr k

26
Georgia Tech ECE 6250 Notes by J. Romberg. Last updated 15:39, November 8, 2013
The steepest descent algorithm performs this iteration until
kHxk − bk2 is below some tolerance δ:

Steepest Descent, version 1


Initialize: x0 = some guess, k = 0, r 0 = b − Hx0.
while not converged, kr k k2 ≥ δ do
αk = r Tk r k /r Tk Hr k
xk+1 = xk + αk r k
r k+1 = b − Hxk+1
k =k+1
end while

There is a nice trick that can save us one of two applications of H


needed in each iteration above. Notice that

r k+1 = b − Hxk+1 = b − H(xk + αk r k )


= r k − αk Hr k .

So we can save an application of H by updating the residual rather


than recomputing it at each stage.

27
Georgia Tech ECE 6250 Notes by J. Romberg. Last updated 15:39, November 8, 2013
Steepest Descent, more efficient version 2
Initialize: x0 = some guess, k = 0, r 0 = b − Hx0.
while not converged, kr k k2 ≥ δ do
q = Hr k
αk = r Tk r k /r Tk q
xk+1 = xk + αk r k
r k+1 = r k − αk q
k =k+1
end while

The effectiveness of SD depends critically on how H is conditioned


and the starting point. Consider the two examples on the next page.

28
Georgia Tech ECE 6250 Notes by J. Romberg. Last updated 15:39, November 8, 2013
Figure 17: Convergence of Steepest Descent as a function of (the slope of ) and (the condition
number of ). Convergence is fast when or are small. For a fixed matrix, convergence is worst when
.
2 (a) 2 (b)
6 6

4 4

2 2

1 1
-4 -2 2 4 -4 -2 2 4
-2 -2

-4 -4
2 2
(c) (d)
6 6

4 4

2 2

1 1
-4 -2 2 4 -4 -2 2 4
-2 -2

-4 -4

Figure 18: These four examples represent points near the corresponding four corners of the graph in
Figure 17. (a) Large , small . (b) An example of poor convergence. and are both large. (c) Small
(from Shewchuk, “... without the agonizing pain”)
and . (d) Small , large .

When the conditioning of H is poor and we choose a bad starting


point, convergence can take many iterations even in simple cases.

29
Georgia Tech ECE 6250 Notes by J. Romberg. Last updated 15:39, November 8, 2013

You might also like