20 Notes 6250 f13
20 Notes 6250 f13
20 Notes 6250 f13
This cost can be prohibitive for even moderately large M and N . But
inverse problems with large M and N are common in the modern
world. For example, a typical 3D MRI scan will try to reconstruct a
128 × 128 × 128 cube of voxels from about 5 million non-uniformly
spaced samples in the spatial Fourier domain. In this case, the matrix
A, which models the MRI machine, is M = 5 · 106 by N = 2.1 · 106.
22
Georgia Tech ECE 6250 Notes by J. Romberg. Last updated 15:39, November 8, 2013
of solving a least-squares problem dramatically smaller — about the
cost of a few hundred applications of A.
In the MRI example above, it takes about one second to apply ATA,
and the conjugate gradients method converges in about 50 iterations,
meaning that the problem can be solved in less than a minute. Also,
the storage requirement is on the order of O(M + N ), rather than
O(M N ).
23
Georgia Tech ECE 6250 Notes by J. Romberg. Last updated 15:39, November 8, 2013
Since
1 T 1
∇x x Hx − xTb = ∇x(xTHx) − ∇x(xTb)
2 2
= Hx − b,
this means that x̂ is the solution to (1) if and only if
H x̂ = b.
Steepest descent
Say you have an unconstrained optimization program
min f (x)
x∈RN
24
Georgia Tech ECE 6250 Notes by J. Romberg. Last updated 15:39, November 8, 2013
8 Jonathan Richard Shewchuk
2
1
-4 -2 2 4 6
0 -2
-4
-6
(10)
(11)
For our particular optimization problem
1 (12)
1 T 8. Note the zigzag
Tpath, which appears because each
min x Hx − x b,
The example is run until it converges in Figure
gradient is orthogonal to the previous gradient.
x 2
The algorithm, as written above, requires two matrix-vector multiplications per iteration. The computa-
tional cost of Steepest Descent is dominated by matrix-vector products; fortunately, one can be eliminated.
we can explicitly compute both the gradient and the best choice of
By premultiplying both sides of Equation 12 by and adding , we have
There is a nifty way to choose an optimal value for the step size αk .
We want to choose αk so that f (xk+1) is as small as possible. It is
25
Georgia Tech ECE 6250 Notes by J. Romberg. Last updated 15:39, November 8, 2013
not hard to show that along the ray xk + αr k for α ≥ 0,
f (xk + αr k ) is convex as a function of α.
Thus we can choose the value of α that makes the derivative of this
function zero; we want
d
f (xk + αr k ) = 0.
dα
By the chain rule,
d d
f (xk+1) = ∇f (xk+1)T xk+1
dα dα
T
= ∇f (xk+1) r k .
So we need to choose αk such that
∇f (xk+1) ⊥ r k ,
or more concisely
r k+1 ⊥ r k (r Tk+1r k = 0).
So let’s do this
r Tk+1r k =0
⇒ (b − Hxk+1)Tr k =0
⇒ (b − H(xk + αk r k ))Tr k =0
⇒ (b − Hxk )Tr k − αk r Tk Hr k =0
⇒ r Tk r k − αk r Tk Hr k =0
and so the optimal step size is
r Tk r k
αk = T .
r k Hr k
26
Georgia Tech ECE 6250 Notes by J. Romberg. Last updated 15:39, November 8, 2013
The steepest descent algorithm performs this iteration until
kHxk − bk2 is below some tolerance δ:
27
Georgia Tech ECE 6250 Notes by J. Romberg. Last updated 15:39, November 8, 2013
Steepest Descent, more efficient version 2
Initialize: x0 = some guess, k = 0, r 0 = b − Hx0.
while not converged, kr k k2 ≥ δ do
q = Hr k
αk = r Tk r k /r Tk q
xk+1 = xk + αk r k
r k+1 = r k − αk q
k =k+1
end while
28
Georgia Tech ECE 6250 Notes by J. Romberg. Last updated 15:39, November 8, 2013
Figure 17: Convergence of Steepest Descent as a function of (the slope of ) and (the condition
number of ). Convergence is fast when or are small. For a fixed matrix, convergence is worst when
.
2 (a) 2 (b)
6 6
4 4
2 2
1 1
-4 -2 2 4 -4 -2 2 4
-2 -2
-4 -4
2 2
(c) (d)
6 6
4 4
2 2
1 1
-4 -2 2 4 -4 -2 2 4
-2 -2
-4 -4
Figure 18: These four examples represent points near the corresponding four corners of the graph in
Figure 17. (a) Large , small . (b) An example of poor convergence. and are both large. (c) Small
(from Shewchuk, “... without the agonizing pain”)
and . (d) Small , large .
29
Georgia Tech ECE 6250 Notes by J. Romberg. Last updated 15:39, November 8, 2013