Fast Solution of ' - Norm Minimization Problems When The Solution May Be Sparse
Fast Solution of ' - Norm Minimization Problems When The Solution May Be Sparse
Fast Solution of ' - Norm Minimization Problems When The Solution May Be Sparse
1
-norm Minimization Problems
When the Solution May be Sparse
David L. Donoho
October 2006
Abstract
The minimum
1
-norm solution to an underdetermined system of linear equations y = Ax,
is often, remarkably, also the sparsest solution to that system. This sparsity-seeking property
is of interest in signal processing and information transmission. However, general-purpose
optimizers are much too slow for
1
minimization in many large-scale applications.
The Homotopy method was originally proposed by Osborne et al. for solving noisy
overdetermined
1
-penalized least squares problems. We here apply it to solve the noiseless
underdetermined
1
-minimization problem min |x|
1
subject to y = Ax. We show that
Homotopy runs much more rapidly than general-purpose LP solvers when sucient sparsity
is present. Indeed, the method often has the following k-step solution property: if the
underlying solution has only k nonzeros, the Homotopy method reaches that solution in
only k iterative steps. When this property holds and k is small compared to the problem
size, this means that
1
minimization problems with k-sparse solutions can be solved in a
fraction of the cost of solving one full-sized linear system.
We demonstrate this k-step solution property for two kinds of problem suites. First,
incoherent matrices A, where o-diagonal entries of the Gram matrix A
T
A are all smaller
than M. If y is a linear combination of at most k (M
1
+ 1)/2 columns of A, we
show that Homotopy has the k-step solution property. Second, ensembles of d n random
matrices A. If A has iid Gaussian entries, then, when y is a linear combination of at most
k < d/(2 log(n)) (1
n
) columns, with
n
> 0 small, Homotopy again exhibits the k-step
solution property with high probability. Further, we give evidence showing that for ensembles
of dn partial orthogonal matrices, including partial Fourier matrices, and partial Hadamard
matrices, with high probability, the k-step solution property holds up to a dramatically higher
threshold k, satisfying k/d < (d/n), for a certain empirically-determined function ().
Our results imply that Homotopy can eciently solve some very ambitious large-scale
problems arising in stylized applications of error-correcting codes, magnetic resonance imag-
ing, and NMR spectroscopy. Our approach also sheds light on the evident parallelism in
results on
1
minimization and Orthogonal Matching Pursuit (OMP), and aids in explaining
the inherent relations between Homotopy, LARS, OMP, and Polytope Faces Pursuit.
Key Words and Phrases: LASSO. LARS. Homotopy Methods. Basis Pursuit.
1
min-
imization. Orthogonal Matching Pursuit. Polytope Faces Pursuit. Sparse Representations.
Underdetermined Systems of Linear Equations.
Acknowledgements. This work was supported in part by grants from NIH, ONR-MURI,
and NSF DMS 00-77261, DMS 01-40698 (FRG) and DMS 05-05303. We thank Michael Lustig
and Neal Bangerter of the Magnetic Resonance Systems Research Lab (MRSRL) at Stanford
University for supplying us with 3-D MRI data for use in our examples. YT would like to thank
Michael Saunders for helpful discussions and references.
Institute for Computational and Mathematical Engineering, Stanford University, Stanford CA, 94035
1
1 Introduction
Recently,
1
-norm minimization has attracted attention in the signal processing community
[9, 26, 20, 7, 5, 6, 8, 27, 31, 32, 44], as an eective technique for solving underdetermined
systems of linear equations, of the following form. A measurement vector y R
d
is generated
from an unknown signal of interest x
0
R
n
by a linear transformation y = Ax
0
, with a known
matrix A of size d n. An important factor in these applications is that d < n, so the system
of equations y = Ax is underdetermined. We solve the convex optimization problem
(P
1
) min
x
|x|
1
subject to y = Ax,
obtaining a vector x
1
which we consider an approximation to x
0
.
Applications of (P
1
) have been proposed in the context of time-frequency representation [9],
overcomplete signal representation [20, 26, 27, 31, 32], texture/geometry separation [48, 49],
compressed sensing (CS) [14, 7, 56], rapid MR Imaging [37, 38], removal of impulsive noise
[21], and decoding of error-correcting codes (ECC) [5, 44]. In such applications, the underlying
problem is to obtain a solution to y = Ax which is as sparse as possible, in the sense of having
few nonzero entries. The above-cited literature shows that often, when the desired solution x
0
is suciently sparse, (P
1
) delivers either x
0
or a reasonable approximation.
Traditionally, the problem of nding sparse solutions has been catalogued as belonging to a
class of combinatorial optimization problems, whereas (P
1
) can be solved by linear programming,
and is thus dramatically more tractable in general. Nevertheless, general-purpose LP solvers
ultimately involve solution of full n n linear systems, and require many applications of such
solvers, each application costing order O(n
3
) ops.
Many of the interesting problems where (P
1
) has been contemplated are very large in scale.
For example, already 10 years ago, Basis Pursuit [9] tackled problems with d = 8, 192 and
n = 262, 144, and problem sizes approached currently [55, 7, 56, 38] are even larger. Such
large-scale problems are simply too large for general-purpose strategies to be used in routine
processing.
1.1 Greedy Approaches
Many of the applications of (P
1
) can instead be attacked heuristically by tting sparse models,
using greedy stepwise least squares. This approach is often called Matching Pursuit or Orthogo-
nal Matching Pursuit (Omp) [12] in the signal processing literature. Rather than minimizing an
objective function, Omp constructs a sparse solution to a given problem by iteratively building
up an approximation; the vector y is approximated as a linear combination of a few columns of
A, where the active set of columns to be used is built column by column, in a greedy fashion.
At each iteration a new column is added to the active set the column that best correlates with
the current residual.
Although Omp is a heuristic method, in some cases it works marvelously. In particular, there
are examples where the data y admit sparse synthesis using only k columns of A, and greedy
selection nds exactly those columns in just k steps. Perhaps surprisingly, optimists can cite
theoretical work supporting this notion; work by Tropp et al. [52, 54] and Donoho et al. [19] has
shown that Omp can, in certain circumstances, succeed at nding the sparsest solution. Yet,
theory provides comfort for pessimists too; Omp fails to nd the sparsest solution in certain
scenarios where
1
minimization succeeds [9, 15, 52].
Optimists can also rely on empirical evidence to support their hopeful attitude. By now
several research groups have conducted studies of Omp on problems where A is random with
2
iid Gaussian elements and x
0
is sparse but random (we are aware of experiments by Tropp,
by Petukhov, as well as ourselves) all with the conclusion that Omp despite its greedy and
heuristic nature performs roughly as well as (P
1
) in certain problem settings. Later in this
paper we document this in detail. However, to our knowledge, there is no theoretical basis
supporting these empirical observations.
This has the air of mystery why do (P
1
) and Omp behave so similarly, when one has a
rigorous foundation and the other one apparently does not? Is there some deeper connection
or similarity between the two ideas? If there is a similarity, why is Omp so fast while
1
minimization is so slow?
In this paper we aim to shed light on all these questions.
1.2 A Continuation Algorithm To Solve (P
1
)
In parallel with developments in the signal processing literature, there has also been interest in
the statistical community in tting regression models while imposing
1
-norm constraints on the
regression coecients. Tibshirani [51] proposed the so-called Lasso problem, which we state
using our notation as follows:
(L
q
) min
x
|y Ax|
2
2
subject to |x|
1
q;
in words: a least-squares t subject to an
1
-norm constraint on the coecients. In Tibshiranis
original proposal, A
dn
was assumed to have d > n, i.e. representing an overdetermined linear
system.
It is convenient to consider instead the unconstrained optimization problem
(D
) min
x
|y Ax|
2
2
/2 + |x|
1
,
i.e. a form of
1
-penalized least-squares. Indeed, problems (L
q
) and (D
) :
[0, ) a solution x
: [0, ) identies
a solution path, with x
|
1
so that the solution paths of (D
) and (L
q()
) coincide.
In the classic overdetermined setting, d > n, Osborne, Presnell and Turlach [40, 41] made
the useful observation that the solution path of (D
), 0 (or (L
q
), q 0) is polygonal.
Further, they characterized the changes in the solution x
= 0 for
large, with an empty active set. At each vertex, the active set is updated through the addition
and removal of variables. Thus, in a sequence of steps, it successively obtains the solutions
x
) is undergoing a
homotopy from the
2
constraint to the
1
objective as decreases.
3
Malioutov et al. [39] were the rst to apply the Homotopy method to the formulation (D
)
in the underdetermined setting, when the data are noisy. In this paper, we also suggest using
the Homotopy method for solving (P
1
) in the underdetermined setting.
Homotopy Algorithm for Solving (P
1
): Apply the Homotopy method: follow the solution
path from x
0
= 0 to x
0
. Upon reaching the = 0 limit, (P
1
) is solved.
Traditionally, to solve (P
1
), one would apply the simplex algorithm or an interior-point
method, which, in general, starts out with a dense solution and converges to the solution of (P
1
)
through a sequence of iterations, each requiring the solution of a full linear system. In contrast,
the Homotopy method starts out at x
0
= 0, and successively builds a sparse solution by adding
or removing elements from its active set. Clearly, in a sparse setting, this latter approach is
much more favorable, since, as long as the solution has few nonzeros, Homotopy will reach the
solution in a few steps.
Numerically, each step of the algorithm involves the rank-one update of a linear system,
and so if the whole procedure stops in k steps, yielding a solution with k nonzeros, its overall
complexity is bounded by k
3
+kdn ops. For k d and d n, this is far better than the d
3
/3
ops it would take to solve just one d d linear system. Moreover, to solve (D
) for all 0
by a traditional approach, one would need to repeatedly solve a quadratic program for every
value of interest. For any problem size beyond the very smallest, that would be prohibitively
time consuming. In contrast, Homotopy delivers all the solutions x
to (D
), 0.
1.3 LARS
Efron, Hastie, Johnstone, and Tibshirani [24] developed an approximation to the Homotopy
algorithm which is quite instructive. The Homotopy algorithm maintains an active set of
nonzero variables composing the current solution. When moving to a new vertex of the solution
polygon, the algorithm may either add new elements to or remove existing elements from the
active set. The Lars procedure is obtained by following the same sequence of steps, only
omitting the step that considers removal of variables from the active set, thus constraining its
behavior to adding new elements to the current approximation. In other words, once activated,
a variable is never removed.
In modifying the stepwise rule, of course, one implicitly obtains a new polygonal path, the
Lars path, which in general may be dierent from the Lasso path. Yet, Efron et al. observed
that in practice, the Lars path is often identical to the Lasso path. This equality is very
interesting in the present context, because Lars is so similar to Omp. Both algorithms build up
a model a step at a time, adding a new variable to the active set at each step, and ensuring that
the new variable is in some sense the most important among the potential candidate variables.
The details of determining importance dier but in both cases involve the inner product between
the candidate new variables and the current residual.
In short, a stepwise algorithm with a greedy avor can sometimes produce the same result
as full-blown
1
minimization. This suggests a possibility which can be stated in two dierent
ways:
... that an algorithm for quasi
1
minimization runs just as rapidly as Omp.
... that an algorithm visibly very similar to Omp can be just as eective as
1
minimization.
4
1.4 Properties of Homotopy
Another possibility comes to mind. The Homotopy algorithm is, algorithmically speaking,
a variant of Lars, diering only in a few lines of code. And yet this small variation makes
Homotopy rigorously able to solve a global optimization problem.
More explicitly, the dierence between Homotopy and Lars lies in the provision by Ho-
motopy for terms to leave as well as enter the active set. This means that the number of steps
required by Homotopy can, in principle, be signicantly greater than the number of steps re-
quired by Lars, as terms enter and leave the active set numerous times. If so, we observe model
churning which causes Homotopy to run slowly. We present evidence that under favorable
conditions, such churning does not occur. In such cases, the Homotopy algorithm is roughly
as fast as both Lars and Omp.
In this paper, we consider two settings for performance measurement: deterministic incoher-
ent matrices and random matrices. We demonstrate that, when a k-sparse representation exists,
k d, the Homotopy algorithm nds it in k steps. Results in each setting parallel existing
results about Omp in that setting. Moreover, the discussion above indicates that each step of
Homotopy is identical to a step of Lars and therefore very similar to a corresponding step of
Omp. We interpret this parallelism as follows.
The Homotopy algorithm, in addition to solving the
1
minimization problem,
runs just as rapidly as the heuristic algorithms Omp/ Lars, in those problem suites
where those algorithms have been shown to correctly solve the sparse representation
problem.
Since there exists a wide range of cases where
1
-minimization is known to nd the sparsest
solution while Omp is known to fail [9, 15, 52], Homotopy gives in some sense the best of both
worlds by a single algorithm: speed where possible, sparsity where possible.
In addition, our viewpoint sheds new perspective on previously-observed similarities between
results about
1
minimization and about Omp. Previous theoretical work [19, 52, 54] found that
these two seemingly disparate methods had comparable theoretical results for nding sparse
solutions. Our work exhibits inherent similarities between the two methods which motivate
these parallel ndings.
Figure 1 summarizes the relation between
1
minimization and Omp, by highlighting the inti-
mate bonds between Homotopy, Lars, and Omp. We elaborate on these inherent connections
in a later section.
1.5 Main Results
Our results about the stopping behavior of Homotopy require some terminology.
Denition 1 A problem suite o(E,V; d, n, k) is a collection of problems dened by three com-
ponents: (a) an ensemble E of matrices A of size d n; (b) an ensemble V of n-vectors
0
with
at most k nonzeros; (c) an induced ensemble of left-hand sides y = A
0
.
Where the ensemble V is omitted from the denition of a problem suite, we interpret that as
saying that the relevant statements hold for any nonzero distribution. Otherwise, V will take
one of three values: (Uniform), when the nonzeros are iid uniform on [0, 1]; (Gauss), when the
nonzeros are iid standard normal; and (Bernoulli), when the nonzeros are iid equi-probable
Bernoulli distributed with values 1.
5
!
l
1
minimization
HOMOTOPY LARS
OMP
(1) (2) (3)
Figure 1: Bridging
1
minimization and OMP. (1) Homotopy provably solves
1
minimization
problems [24]. (2) Lars is obtained fromHomotopy by removing the sign constraint check. (3)
Omp and Lars are similar in structure, the only dierence being that Omp solves a least-squares
problem at each iteration, whereas Lars solves a linearly-penalized least-squares problem. There
are initially surprising similarities of results concerning the ability of OMP and
1
minimization
to nd sparse solutions. We view those results as statements about the k-step solution property.
We present evidence showing that for suciently small k, steps (2) and (3) do not aect the
threshold for the k-sparse solution property. Since both algorithms have the k-step solution
property under the given conditions, they both have the sparse solution property under those
conditions.
Denition 2 An algorithm / has the k-step solution property at a given problem instance (A, y)
drawn from a problem suite o(E,V; d, n, k) if, when applied to the data (A, y), the algorithm
terminates after at most k steps with the correct solution.
Our main results establish the k-step solution property in specic problem suites.
1.5.1 Incoherent Systems
The mutual coherence M(A) of a matrix A whose columns are normalized to unit length is the
maximal o-diagonal entry of the Gram matrix A
T
A. We call the collection of matrices A with
M(A) the incoherent ensemble with coherence bound (denoted Inc
). Let o(Inc
; d, n, k)
be the suite of problems with d n matrices drawn from the incoherent ensemble Inc
, with
vectors
0
having |
0
|
0
k. For the incoherent problem suite, we have the following result:
Theorem 1 Let (A, y) be a problem instance drawn from o(Inc
#
i
t
e
r
a
t
i
o
n
s
/
n
(a) Gaussian noise
0.1 0.2 0.3 0.4
0.05
0.1
0.15
0.2
0.25
0.3
0.35
#
i
t
e
r
a
t
i
o
n
s
/
n
(b) Bernoulli noise
0.1 0.2 0.3 0.4
0.05
0.1
0.15
0.2
0.25
0.3
0.35
#
i
t
e
r
a
t
i
o
n
s
/
n
(c) Cauchy noise
0.1 0.2 0.3 0.4
0.05
0.1
0.15
0.2
0.25
0.3
0.35
#
i
t
e
r
a
t
i
o
n
s
/
n
(d) Rayleigh noise
Figure 2: Performance of Homotopy in decoding a rate-1/5 partial Hadamard code. Each
panel shows the number of iterations, divided by n, that Homotopy takes in order to recover
the coded signal, versus , the fraction of corrupt entries in the transmitted signal, with the
noise distributed (a) N(0,1); (b) 1 with equal probabilities; (c) Cauchy; (d) Rayleigh. In each
plot o indicates k-step recovery, and x implies recovery in more than k steps.
p rows of H at random. The decoding matrix D is then composed of the other 4p rows of H.
The encoding stage amounts to computing x = E
T
, with x the encoded signal, of length n. At
the decoder, we receive r = x+z, where z has nonzeros in k random positions, and the nonzero
follow a specied distribution. At the decoder, we apply Homotopy to solve
(EC
1
) min||
1
subject to D = Dr;
Call the solution . The decoder output is then
= sgn(E
T
(r )).
The key property being exploited is the mutual orthogonality of D and E. Specically, note
that Dr = D(E
T
+ z) = Dz. Hence, (EC
1
) is essentially solving for the sparse error patten.
To demonstrate our claim, we set p = 256, and considered error rates k = n, with varying
in [0.04, 0.4]. In each case we generated a problem instance as described above, and measured the
number of steps Homotopy takes to reach the correct solution. Results are plotted in Figure
2, showing four dierent noise distributions. Inspecting the results, we make two important
observations. First, as claimed, for the noise distributions considered, as long as the fraction
of corrupt entries is less that 0.24n, Homotopy recovers the correct solution in k steps, as
implied by Empirical Finding 2. Second, as increases, the k-step property may begin to fail,
but Homotopy still does not take many more than k steps, in accord with Empirical Finding
3.
9
1.7 Relation to Earlier Work
Our results seem interesting in light of previous work by Osborne et al. [40] and Efron et
al. [24]. In particular, our proposal for fast solution of
1
minimization amounts simply to
applying to (P
1
) previously known algorithms (Homotopy a.k.a Lars/Lasso) designed for
tting equations to noisy data. We make the following contributions.
Few Steps. Efron et al. considered the overdetermined noisy setting, and remarked in
passing that while, in principle Homotopy could take many steps, they had encountered
examples where it took a direct path to the solution, in which a term, once entered into
the active set, stayed in the active set [24]. Our work in the noiseless underdetermined
setting formally identies a precise phenomenon, namely, the k-step solution property, and
delineates a range of problem suites where this phenomenon occurs.
Similarity of Homotopy and Lars. Efron et al., in the overdetermined noisy setting,
commented that, while in principle the solution paths of Homotopy and Lars could be
dierent, in practice they were came across examples where Homotopy and Lars yielded
very similar results [24]. Our work in the noiseless case formally denes a property which
implies that Homotopy and Lars have the same solution path, and delineates a range
of problem suites where this property holds. In addition, we present simulation studies
showing that, over a region in parameter space, where
1
minimization recovers the sparsest
solution, Lars also recovers the sparsest solution.
Similarity of Homotopy and Omp. In the random setting, a result of Tropp and Gilbert
[54] as saying that Omp recovers the sparsest solution under a condition similar to (1.2),
although with a somewhat worse constant term; their result (and proof) can be interpreted
in the light of our work as saying that Omp actually has the k-step solution property under
their condition.
In fact it has the k-step solution property under the condition (1.2). Below, we elaborate
on the connections between Homotopy and Omp leading to these similar results.
Similarity of Homotopy and Polytope Faces Pursuit. Recently, Plumbley introduced a
greedy algorithm to solve (P
1
) in the dual space, which he named Polytope Faces Pursuit
(Pfp) [43, 42]. The algorithm bears strong resemblance to Homotopy and Omp. Below,
we elaborate on Plumbleys work, to show that the anities are not coincidental, and
in fact, under certain conditions, Pfp is equivalent to Homotopy. We then conclude
that Pfp maintains the k-step solution property under the same conditions required for
Homotopy.
In short, we provide theoretical underpinnings, formal structure, and empirical ndings. We
also believe our viewpoint claries the connections between Homotopy, Lars, and Pfp.
1.8 Contents
The paper is organized as follows. Section 2 reviews the Homotopy algorithm in detail. Section
3 presents running-time simulations alongside a formal complexity analysis of the algorithm, and
discusses evidence leading to Empirical Finding 3. In Section 4 we prove Theorem 1, the k-step
solution property for the sparse incoherent problem suite. In Section 5 we demonstrate Empirical
Finding 1, the k-step solution property for the sparse USE problem suite. Section 6 follows, with
evidence to support Empirical Finding 2. In Section 7 we elaborate on the relation between
1
10
minimization and Omp hinted at in Figure 1. This provides a natural segue to Section 8, which
discusses the connections between Homotopy and Pfp. Section 9 then oers a comparison of
the performance of Homotopy, Lars, Omp and Pfp in recovering sparse solutions. In Section
10, we demonstrate the applicability of Homotopy and Lars with examples inspired by NMR
spectroscopy and MR imaging. Section 11 discusses the software accompanying this paper.
Section 12 briey discusses approximate methods for recovery of sparse solutions, and a nal
section has concluding remarks.
2 The Homotopy Algorithm
We shall start our exposition with a description of the Homotopy algorithm. Recall that the
general principle undergirding homotopy methods is to trace a solution path, parametrized by
one or more variables, while evolving the parameter vector from an initial value, for which the
corresponding solution is known, to the desired value. For the Lasso problem (D
), this implies
following the solution x
to be a minimizer of f
(x) is that 0
x
f
(x
), i.e. the
zero vector is an element of the subdierential of f
at x
. We calculate
x
f
(x
) = A
T
(y Ax
) + |x
|
1
, (2.3)
where |x
|
1
is the subgradient
|x
|
1
=
u R
n
u
i
= sgn(x
,i
), x
,i
,= 0
u
i
[1, 1], x
,i
= 0
.
Let I = i : x
, and call c = A
T
(y Ax
(I)), (2.4)
and
[c(I
c
)[ , (2.5)
In words, residual correlations on the support I must all have magnitude equal to , and signs
that match the corresponding elements of x
(j)[ = |c
= , (2.6)
as implied by conditions (2.4) and (2.5). At the -th stage, Homotopy rst computes an update
direction d
, by solving
A
T
I
A
I
d
(I) = sgn(c
(I)), (2.7)
11
with d
set to zero in coordinates not in I. This update direction ensures that the magnitudes of
residual correlations on the active set all decline equally. The algorithm then computes the step
size to the next breakpoint along the homotopy path. Two scenarios may lead to a breakpoint.
First, that a non-active element of c
= min
iI
c
(i)
1 a
T
i
v
,
+ c
(i)
1 + a
T
i
v
, (2.8)
where v
= A
I
d
(I), and the minimum is taken only over positive arguments. Call the minimizing
index i
+
. The second scenario leading to a breakpoint in the path occurs when an active
coordinate crosses zero, violating the sign agreement in (2.4). This rst occurs when
= min
iI
x
(i)/d
(i), (2.9)
where again the minimum is taken only over positive arguments. Call the minimizing index i
.
Homotopy then marches to the next breakpoint, determined by
= min
+
, (2.10)
updates the active set, by either appending I with i
+
, or removing i
= x
1
+
.
The algorithm terminates when |c
and
=
+
.
The resulting scheme adds a single element to the active set at each iteration, never
removing active elements from the set.
3. The Homotopy algorithm may be easily adapted to deal with noisy data. Assume that
rather than observing y
0
= A
0
, we observe a noisy version y = A
0
+z, with |z|
2
n
.
Donoho et al. [19, 16] have shown that, for certain matrix ensembles, the solution x
q
of
(L
q
) with q =
n
has an error which is at worst proportional to the noise level. To solve for
x
n
, we simply apply the Homotopy algorithm as described above, terminating as soon
as the residual satises |r
|
n
. Since in most practical scenarios it is not sensible to
assume that the measurements are perfectly noiseless, this attribute of Homotopy makes
it an apt choice for use in practice.
12
3 Computational Cost
In earlier sections, we claimed that Homotopy can solve the problem (P
1
) much faster than
general-purpose LP solvers, which are traditionally used to solve it. In particular, when the
k-step solution property holds, Homotopy runs in a fraction of the time it takes to solve one
full linear system, making it as ecient as fast stepwise algorithms such as Omp. To support
these assertions, we now present the results of running-time simulations, complemented by a
formal analysis of the asymptotic complexity of Homotopy.
3.1 Running Times
To evaluate the performance of the Homotopy algorithm applied to (P
1
), we compared its
running times on instances of the problem suite o(USE,Gauss; d, n, k) with two state-of-the-
art algorithms for solving Linear Programs. The rst, LP Solve, is a Mixed Integer Linear
Programming solver, implementing a variant of the Simplex algorithm [2]. The other, PDCO,
a Primal-Dual Convex Optimization solver, is a log-barrier interior point method, written by
Michael Saunders of the Stanford Optimization Laboratory [45]. Table 1 shows the running times
for Homotopy, LP Solve and PDCO, for various problem dimensions. The gures appearing
in the table were measured on a 3GHz Xeon workstation.
(d,n,k) Homotopy LP Solve PDCO
(200,500,20) 0.03 2.04 0.90
(250,500,100) 0.94 9.27 1.47
(500,1000,100) 1.28 45.18 4.62
(750,1000,150) 2.15 85.78 7.68
(500,2000,50) 0.34 59.52 6.86
(1000,2000,200) 7.32 407.99 22.90
(1600,4000,320) 31.70 2661.42 122.47
Table 1: Comparison of execution times (in seconds) of Homotopy, LP Solve and PDCO,
applied to instances of the random problem suite o(USE,Gauss; d, n, k).
Examination of the running times in the table suggests two important observations. First,
when the underlying solution to the problem (P
1
) is sparse, a tremendous saving in computation
time is achieved using Homotopy, compared to traditional LP solvers. For instance, when A
is 500 2000, and y admits sparse synthesis with k = 50 nonzeros, Homotopy terminates
in about 0.34 seconds, 20 times faster than PDCO, and over 150 times faster than LP Solve.
Second, when the linear system is highly underdetermined (i.e. d/n is small), even when the
solution is not sparse, Homotopy is more ecient than either LP Solve or PDCO. This latter
observation is of particular importance for applications, as it implies that Homotopy may be
safely used to solve
1
minimization problems even when the underlying solution is not known
to be sparse.
3.2 Complexity Analysis
The timing studies above are complemented by a detailed analysis of the complexity of the
Homotopy algorithm. We begin by noting that the bulk of computation is invested in the
solution of the linear system (2.7) at each iteration. Thus, the key to an ecient implemen-
tation is maintaining a Cholesky factorization of the gram minor A
T
I
A
I
, updating it with the
13
addition/removal of elements to/from the active set. This allows for fast solution for the up-
date direction; O(d
2
) operations are needed to solve (2.7), rather than the O(d
3
) ops it would
ordinarily take. In detail, let = [I[, where I denotes the current active set. The dominant
calculations per iteration are the solution of (2.7) at a cost of 2
2
ops, and computation of the
step to the next vertex on the solution path, using nd+6(n) ops. In addition, the Cholesky
factor of A
T
I
A
I
is either updated by appending a column to A
I
at a cost of
2
+ d + 2( + d)
ops, or downdated by removing a column of A
I
using at most 3
2
ops.
To conclude, without any sparsity constraints on the data, k Homotopy steps would take
at most 4kd
2
/3 +kdn+O(kn) ops. However, if the conditions for the k-step solution property
are satised, a more favorable estimate holds.
Lemma 1 Let (A, y) be a problem instance drawn from a suite o(E,V; d, n, k). Suppose the k-
step solution property holds, and the Homotopy algorithm performs k steps, each time adding
a single element into the active set. The algorithm terminates in k
3
+ kdn + 1/2k
2
(d 1) +
n(8k + d + 1) + 1/2k(5d 3) ops.
Notice that, for k d, the dominant term in the expression for the operation count is kdn, the
number of ops needed to carry out k matrix-vector multiplications. Thus, we may interpret
Lemma 1 as saying that, under such favorable conditions, the computational workload of Ho-
motopy is roughly proportional to k applications of a d n matrix. For comparison, applying
least-squares to solve the underdetermined linear system Ax = y would require 2d
2
n 2d
3
/3
operations [30]. Thus, for k d, Homotopy runs in a fraction of the time it takes to solve one
least-squares system.
To visualize this statement, panel (a) of Figure 3 displays the operation count of Homotopy
on a grid with varying sparsity and indeterminacy factors. In this simulation, we measured the
total operation count of Homotopy as a fraction of the solution of one dn least-squares system,
for random instances of the problem suite o(USE,Uniform; d, n, k). We xed n = 1000, varied
the indeterminacy of the system, indexed by = d/n, in the range [0.1, 1], and the sparsity
of the solution, indexed by = k/d, in the range [0.05, 1]. For reference, we superposed the
theoretical bound
W
below which the solution of (P
1
) is, with high probability, the sparsest
solution (see section 9 for more details). Close inspection of this plot reveals that, below this
curve, Homotopy delivers the solution to (P
1
) rapidly, much faster than it takes to solve one
d n least-squares problem.
3.3 Number of Iterations
Considering the analysis just presented, it is clear that the computational eciency of Homo-
topy greatly depends on the number of vertices on the polygonal Homotopy path. Indeed,
since it allows removal of elements from the active set, Homotopy may, conceivably, require
an arbitrarily large number of iterations to reach a solution. We note that this property is not
shared by Lars or Omp; owing to the fact that these algorithms never remove elements from
the active set, after d iterations they terminate with zero residual.
In [24], Efron et al. briey noted that they had observed examples where Homotopy does
not drop elements from the active set very frequently, and so, overall, idoesnt require many
more iterations than Lars to reach a solution. We explore this initial observation further,
and present evidence leading to Empirical Finding 3. Specically, consider the problem suite
o(E,V; d, n, k), with E USE,PFE,PHE,URPE , and V Uniform,Gauss,Bernoulli
. The space (d, n, k) of dimensions of underdetermined problems may be divided into three
regions:
14
Figure 3: Computational Cost of Homotopy. Panel (a) shows the operation count as a fraction
of one least-squares solution on a - grid, with n = 1000. Panel (b) shows the number of
iterations as a fraction of d = n. The superimposed dashed curve depicts the curve
W
, below
which Homotopy recovers the sparsest solution with high probability.
1. k-step region. When (d, n, k) are such that the k-step property holds, then Homotopy
successfully recovers the sparse solution in k steps, as suggested by Empirical Findings 1
and 2. Otherwise;
2. k-sparse region. When (d, n, k) are such that
1
minimization correctly recovers k-sparse
solutions but Homotopy takes more than k steps, then, with high probability, it takes
no more than c
s
d steps, with c
s
a constant depending on E, V , empirically found to be
1.6. Otherwise;
3. Remainder. With high probability, Homotopy does not recover the sparsest solution,
returning a solution to (P
1
) in c
f
d steps, with c
f
a constant depending on E, V , empirically
found to be 4.85.
We note that the so-called constants c
s
, c
f
depend weakly on and n.
A graphical depiction of this division of the space of admissible (d, n, k) is given in panel
(b) of Figure 3. It shows the number of iterations Homotopy performed for various (d, n, k)
congurations, as a shaded attribute on a grid indexed by and . Inspection of this plot
reveals that Homotopy performs at most 1.6 d iterations, regardless of the underlying
solution sparsity. In particular, in the region below the curve
W
, where, with high probablity,
Homotopy recovers the sparsest solution, it does so in less than d steps.
More extensive evidence is given in Table 2, summarizing the results of a comprehensive
study. We considered four matrix ensembles E, each coupled with three nonzero ensembles
V . For a problem instance drawn from a o(E,V; d, n, k), we recorded the number of iterations
required to reach a solution. We repeated this at many dierent (d, n, k) congurations, gen-
erating 100 independent realizations for each (d, n, k), and computing the average number of
iterations observed at each instance. Table 2 displays the estimated constants c
s
, c
f
for dierent
combinations of matrix ensemble and coecient ensemble. Thus, the results in Table 2 read,
e.g., Applied to a problem instance drawn from o(USE,Uniform; d, n, k), Homotopy takes,
with high probability, no more than 1.69 d iterations to obtain the minimum
1
solution.
15
Coe. Ensemble / Uniform Gauss Bernoulli
Matrix Ensemble c
s
c
f
c
s
c
f
c
s
c
f
USE 1.65 1.69 1.6 1.7 1.23 1.68
PFE 0.99 3.44 0.99 1.42 0.86 1.49
PHE 0.99 4.85 0.92 1.44 0.87 1.46
URPE 1.05 4.4 1 1.46 1 1.44
Table 2: Maximum number of Homotopy iterations, as a fraction of d, for various matrix /
coecient ensembles.
4 Fast Solution with the Incoherent Ensemble
Let the d n matrix A have unit-length columns a
j
, |a
j
|
2
= 1. The mutual coherence
M(A) = max
i=j
[a
i
, a
j
)[
measures the smallest angle between any pair of columns. As n > d, this angle must be greater
than zero: the columns cannot be mutually orthogonal; in fact, there is an established lower
bound on the coherence of A:
M(A)
n d
d(n 1)
,
Matrices which satisfy this lower bound with equality are known as optimal Grassmannian frames
[50]. There is by now much work establishing relations between the sparsity of a coecient vector
0
and the coherence of a matrix A needed for successful recovery of
0
via
1
minimization
[20, 18, 53, 31] or Omp [19, 52]. In detail, for a general matrix A with coherence M(A), both
1
minimization and OMP recover the solution
0
from data y = A
0
whenever the number of
nonzeros in
0
satises
|
0
|
0
< (M(A)
1
+ 1)/2. (4.1)
see, for example, Theorem 7 of [18] or Corollary 3.6 of [52]. Comparing (4.1) with (1.1), we see
that Theorem 1 essentially states that a parallel result holds for the Homotopy algorithm.
Before proceeding to the proof of the Theorem, we make a few introductory assumptions.
As in the above, we assume that the Homotopy algorithm operates with a problem instance
(A, y) as input, with y = A
0
and |
0
|
0
= k. To simplify notation, we assume, without loss
of generality, that
0
has its nonzeros in the rst k positions. Further, we operate under the
convention that at each step of the algorithm, only one vector is introduced into the active
set. If two or more vectors are candidates to enter the active set, assume the algorithm inserts
them one at a time, on separate stages. Finally, to x notation, let x
= y Ax
= A
T
r
be the
corresponding residual correlation vector. To prove Theorem 1, we introduce two useful notions.
Denition 3 We say that the Homotopy algorithm has the Correct Term Selection Property
at a given problem instance (A, y), with y = A
0
, if at each iteration, the algorithm selects a
new term to enter the active set from the support set of
0
.
Homotopy has the correct term selection property if, throughout its operation, it builds the
solution using only correct terms. Thus, at termination, the support set of the solution is
guaranteed to be a subset of the support set of
0
.
16
Denition 4 We say that the Homotopy algorithm has the Sign Agreement Property at a
given problem instance (A, y), with y = A
0
, if at every step , for all j I,
sgn(x
(j)) = sgn(c
(j)).
In words, Homotopy has the sign agreement property if, at every step of the algorithm, the
residual correlations in the active set agree in sign with the corresponding solution coecients.
This ensures that the algorithm never removes elements from the active set.
These two properties are the fundamental building blocks needed to ensure that the k-
step solution property holds. In particular, these two properties are necessary and sucient
conditions for successful k-step termination, as the following lemma shows.
Lemma 2 Let (A, y) be a problem instance drawn from a suite o(E,V; d, n, k). The Homotopy
algorithm, when applied to (A, y), has the k-step solution property if and only if it has the correct
term selection property and the sign agreement property.
Proof. To argue in the forward direction, we note that, after k steps, correct term selection
implies that the active set is a subset of the support of
0
, i.e. I 1, ..., k. In addition, the
sign agreement property ensures that no variables leave the active set. Therefore, after k steps,
I = 1, ..., k and the Homotopy algorithm recovers the correct sparsity pattern. To show that
after k steps, the algorithm terminates, we note that, at the k-th step, the step-size
k
is chosen
so that, for some j I
c
,
[c
k
(j)
k
a
T
j
A
I
d
k
(I)[ =
k
k
, (4.2)
with
k
= |c
k
(I)|
=
k
j=1
(j)a
j
.
Then the next step of the Homotopy algorithm selects an index from among the rst k.
Proof. We will show that at the -th step,
max
1ik
[r
, a
i
)[ > max
i>k
[r
, a
i
)[, (4.4)
and so at the end of the -th iteration, the active set is a subset of 1, ..., k.
Let G = A
T
A denote the gram matrix of A. Let
i = arg max
1ik
[
i
[. The left hand side
of (4.4) is bounded below by
max
1ik
[r
, a
i
)[ [r
, a
i
)[
[
k
j=1
(j)G
ij
[
[
i)[
j=
i
[G
ij
[[
(j)[
[
i)[ M(A)
j=
i
[
(j)[
[
i)[, (4.5)
Here we used: |a
j
|
2
2
= 1 for all j and G
ij
M(A) for j ,=
, a
i
)[
k
j=1
[
(j)[[G
ij
[
M(A)
k
j=1
[
(j)[
kM(A)[
i)[. (4.6)
Combining (4.5) and (4.6), we get that for (4.4) to hold, we need
1 (k 1)M(A) > kM(A). (4.7)
Since k was selected to exactly satisfy this bound, relation (4.4) follows.
Thus, when k satises (4.3), the Homotopy algorithm only ever considers indices among
the rst k as candidates to enter the active set.
4.2 Sign Agreement
Recall that an index is removed from the active set at step only if condition (2.4) is violated, i.e.
if the signs of the residual correlations c
= A
T
r
, and let the active set I be dened as in (2.6). Then the update
direction d
(I)) = sgn(c
(I)). (4.8)
Proof. Let
= |c
(i) sgn(c
(A
T
I
A
I
I
d
)d
(I) =
(I) + c
(I),
where I
d
is the identity matrix of appropriate dimension. This yields
|
(I) c
(I)|
|A
T
I
A
I
I
d
|
(,)
|
(I)|
1 M(A)
2
|
(I)|
1 M(A)
2
(|c
(I)|
+|
(I) c
(I)|
1 M(A)
2
(
+|
(I) c
(I)|
),
where | |[
(,)
denotes the induced
(I) c
(I)|
1 M
1 + M
<
,
thus,
|d
(I) sgn(c
(I))|
< 1.
Relation (4.8) follows.
4.3 Proof of Theorem 1
Lemmas 3 and 4 establish the correct term selection property and the sign agreement property
at a single iteration of the algorithm. We will now give an inductive argument showing that the
two properties hold at every step 1, ..., k.
In detail, the algorithm starts with x
0
= 0; by the sparsity assumption on y = A
0
, we may
apply Lemma 3 with r
0
= y, to get that at the rst step, a column among the rst k is selected.
Moreover, at the end of the rst step we have x
1
=
1
d
1
, and so, by Lemma 4,
sgn(x
1
(I)) =
1
(I).
By induction, assume that at step , only indices among the rst k are in the active set, i.e.
r
=
k
j=1
(j)a
j
,
19
and the sign condition (2.4) holds, i.e.
sgn(x
(I)) =
(I).
Applying Lemma 3 to r
, we get that at the (l +1)-th step, the term to enter the active set will
be selected from among the rst k indices. Moreover, for the updated solution we have
sgn(x
+1
(I)) = sgn(x
(I) +
+1
d
+1
(I)).
By Lemma 4 we have
sgn(d
+1
(I)) =
+1
(I).
We observe that
c
+1
= c
+1
A
T
Ad
+1
,
whence, on the active set,
[c
+1
(I)[ = [c
(I)[(1
+1
),
and so
+1
(I) =
(I).
In words, once a vector enters the active set, its residual correlation maintains the same sign.
We conclude that
sgn(x
+1
(I)) =
+1
(I). (4.9)
Hence no variables leave the active set. To summarize, both the correct term selection property
and the sign agreement property are maintained throughout the execution of the Homotopy
algorithm.
We may now invoke Lemma 2 to conclude that the k-step solution property holds. This
completes the proof of Theorem 1.
5 Fast Solution with the Uniform Spherical Ensemble
Suppose now that A is a random matrix drawn from the Uniform Spherical Ensemble. It is
not hard to show that such matrices A are naturally incoherent; in fact, for > 0, with high
probability for large n
M(A)
4 log(n)
d
(1 + ).
Thus, applying Theorem 1, we get as an immediate corollary that if k satises
k
d/ log(n) (1/4
), (5.10)
then Homotopy is highly likely to recover any sparse vector
0
with at most k nonzeros.
However, as it turns out, Homotopy operates much more eciently when applied to instances
from the random suite o(USE; d, n, k), than what is implied by incoherence. We now discuss
evidence leading to Empirical Finding 1. We demonstrate, through a comprehensive suite of
simulations, that the formula d/(2 log(n)) accurately describes the breakdown of the k-step
solution property. Before doing so, we introduce two important tools that will be used repeatedly
in the empirical analysis below.
20
5.1 Phase Diagrams and Phase Transitions
In statistical physics, a phase transition is the transformation of a thermodynamic system from
one phase to another. The distinguishing characteristic of a phase transition is an abrupt
change in one or more of its physical properties, as underlying parameters cross a region of
parameter space. Borrowing terminology, we say that property T exhibits a phase transition
at a sequence of problem suites o(E,V; d, n, k), if there is a threshold function Thresh(d, n) on
the parameter space (k, d, n), such that problem instances with k Thresh(d, n) have property
T with high probability, and problem instances above this threshold do not have property T
with high probability. As we show below, the the k-step solution property of Homotopy,
applied to problem instances drawn from o(USE; d, n, k), exhibits a phase transition at around
d/(2 log(n)). Thus, there is a sequence (
n
) with each term small and positive, so that for k <
(1
n
) d/(2 log(n)), Homotopy delivers the sparsest solution in k steps with high probability;
on the other hand, if k > (1 +
n
) d/(2 log(n)), then with high probablity, Homotopy, fails to
terminate in k steps.
To visualize phase transitions, we utilize phase diagrams. In statistical physics, a phase
diagram is a graphical depiction of a parameter space decorated by boundaries of regions where
certain properties hold. Here we use this term in an analogous fashion, to mean a graphical
depiction of a subset of the parameter space (d, n, k), illustrating the region where an algorithm
has a given property T, when applied to a problem suite o(E,V; d, n, k). It will typically take
the form of a 2-D grid, with the threshold function associated with the corresponding phase
transition dividing the grid into two regions: A region where property T occurs with high
probablity, and a region where T occurs with low probablity.
While phase transitions are sometimes discovered by formal mathematical analysis, more
commonly, the existence of phase transitions is unearthed by computer simulations. We now
describe an objective framework for estimating phase transitions from simulation data. For
each problem instance (A, y) drawn from a suite o(E,V; d, n, k), we associate a binary outcome
variable Y
d,n
k
, with Y
d,n
k
= 1 when property T is satised on that realization, and Y
d,n
k
=
0 otherwise. Within our framework, Y
d,n
k
is modeled as a Bernoulli random variable with
probability
d,n
k
. Thus, P(Y
d,n
k
= 1) =
d,n
k
, and P(Y
d,n
k
= 0) = 1
d,n
k
. Our goal is to estimate
a value
k
d,n
where the transition between these two states occurs. To do so, we employ a logistic
regression model on the mean response EY
d,n
k
with predictor variable k,
EY
d,n
k
=
exp(
0
+
1
k)
1 + exp(
0
+
1
k)
. (5.11)
Logistic response functions are often used to model threshold phenomena in statistical data
analysis [34]. Let
d,n
k
denote the value of the tted response function. We may then compute a
value
k
d,n
indicating that, with probability exceeding 1 , for a problem instance drawn from
o(E,V; d, n, k) with k <
k
d,n
, property T holds.
k is given by
k
= 1 .
Thus, computing
k
d,n
for a range of (d, n) values essentially maps out an empirical phase tran-
sition of property T. The value xes a location on the transition band below which we dene
the region of success. In our work, we set = 0.25.
5.2 The k-Step Solution Property
We now apply the method described in the previous section to estimate an empirical phase
transition for the k-step solution property of the Homotopy algorithm. Before doing so, we
21
Figure 4: k-Step Solution Property of Homotopy for the Problem Suite
o(USE,Uniform; d, n, k). Shaded attribute is the proportion of successful terminations,
out of 100 trials, in the case: (a) k = 1 . . . 40, d = 200, and n = 200 . . . 20000; (b) k = 2 . . . 100,
d = 100 . . . 1000, and n = 1000. Overlaid are curves for d/(2 log(n)) (solid),
k
d,n
0.25
(dashed), and
the condence bound for
k
d,n
0.25
with 95% condence level (dotted).
briey describe the experimental setup. For (d, n, k) given, we generated a problem instance
(A, y) drawn from the problem suite o(USE,Uniform; d, n, k). To this instance we applied the
Homotopy algorithm, and recorded the outcome of the response variable Y
d,n
k
. We did so in
two cases; rst, xing d = 200, varying n through 40 log-equispaced points in [200, 20000] and
k through 40 equispaced points in [1, 40]; second, keeping n xed at 1000, varying d through 40
equispaced points in [100, 1000] and k through 40 equispaced points in [2, 100]. For each (d, n, k)
instance we ran 100 independent trials. To estimate the regression coecients
0
,
1
in (5.11),
we used maximum likelihood estimation [34].
Figure 4 displays two phase diagrams, giving a succinct depiction of the simulation results.
Panel (a) displays a grid of (k, n) values, k [1, 40], n [200, 20000]. Each point on this grid
takes a value between 0 and 1, representing the ratio of successful outcomes to overall trials;
these are precisely the observed EY
d,n
k
, on a grid of (k, n) values, with d xed. Overlaid are
curves for the estimated
k
d,n
for each n with = 0.25, and the theoretical bound d/(2 log(n)).
Panel (b) has a similar conguration on a (k, d) grid, with n xed at 1000. Careful inspection
of these phase diagrams reveals that the estimated
k
d,n
n
in (1.2) becomes smaller as n increases. We note that the regression coecient
1
in (5.11)
associated with the predictor variable k dictates how sharp the transition from
d,n
k
= 1 to
d,n
k
= 0 is; a large negative value for
1
implies a step function behavior. Hence, we expect
1
to grow in magnitude as the problem size increases. This is indeed veried in panel (a) of
Figure 5, which displays the behavior of
1
for increasing n, and d xed at 200. For illustration,
panels (b) and (c) show the observed mean response Y
d,n
k
vs. k for n = 200, with
1
0.35,
and n = 20000, with
1
1.1, respectively. Clearly, the phase transition in panel (c) is much
sharper.
5.3 Correct Term Selection and Sign Agreement
In Section 4, we have identied two fundamental properties of the Homotopy solution path,
namely, correct term selection and sign agreement, that constitute necessary and sucient con-
ditions for the k-step solution property to hold. We now present the results of a simulation
study identifying phase transitions in these cases.
In detail, we repeated the experiment described in the previous section, generating problem
instances from the suite o(USE,Uniform; d, n, k) and applying the Homotopy algorithm. As
the outcome of the response variable Y
d,n
k
, we recorded rst the property The Homotopy
23
Figure 6: Phase diagrams for (a) Correct term selection property; (b) Sign agreement property,
on a k-n grid, with k = 1 . . . 40, d = 200, and n = 200 . . . 20000. Overlaid are curves for
k
d,n
0.25
(dashed), and the condence bound for
k
d,n
0.25
with 95% condence level (dotted). Panel (a) also
displays the curve d/(
2 log(n)) (solid), and panel (b) displays the curve d/(2 log(n)) (solid).
The transition on the right-hand display occurs signicantly above the transition in the left-hand
display. Hence the correct term selection property is the critical one for overall success.
algorithm selected only terms from among 1, ..., k as entries to the active set , and next, the
property At each iteration of the Homotopy algorithm, the signs of the residual correlations
on the active set match the signs of the corresponding solution terms . Finally, we estimated
phase transitions for each of these properties. Results are depicted in panels (a) and (b) of
Figure 6. Panel (a) displays a phase diagram for the correct term selection property, and Panel
(b) has a similar phase diagram for the sign agreement property.
The results are quite instructive. The phase transition for the correct term selection property
agrees well with the formula d/(2 log(n)). Rather surprisingly, the phase transition for the sign
agreement property seems to be at a signicantly higher threshold, d/(
2log(n)). Consequently,
the weak link for the Homotopy k-step solution property, i.e. the attribute that dictates the
sparsity bound for the k-step solution property to hold, is the addition of incorrect terms into the
active set. We interpret this nding as saying that the Homotopy algorithm would typically
err rst on the safe side, adding terms o the support of
0
into the active set (a.k.a false
discoveries), before removing correct terms from the active set (a.k.a missed detections).
5.4 Random Signs Ensemble
Matrices in the Random Signs Ensemble (RSE), are constrained to have elements of constant
amplitude, with signs independently drawn from an equi-probable Bernoulli distribution. This
ensemble is known to be eective in various geometric problems associated with underdetemined
systems, through work of Kashin [36], followed by Garnaev and Gluskin [28]. Previous work
[55, 56, 54] gave theoretical and empirical evidence to the fact that many results developed in
the USE case hold when the matrices considered are drawn from the RSE. Indeed, as we now
demonstrate, the RSE shares the k-step solution property with USE.
To show this, we conducted a series of simulations, paralleling the study described in Section
5.2. In our study, we replaced the problem suite o(USE; d, n, k) by o(RSE; d, n, k). Just as
above, we estimated
k
d,n
in two cases: First, xing d = 200, varying n through 40 log-equispaced
24
Figure 7: k-Step Termination of Homotopy for the problem suite o(RSE,Uniform; d, n, k).
Shaded attribute is the proportion of successful terminations, out of 100 trials, in the case: (a)
k = 1 . . . 40, d = 200, and n = 200 . . . 20000; (b) k = 2 . . . 100, d = 100 . . . 1000, and n = 1000.
Overlaid are curves for d/(2 log(n)) (solid),
k
d,n
0.25
(dashed), and the condence bound for
k
d,n
0.25
with 95% condence level (dotted).
points in [200, 20000] and k through 40 equispaced points in [1, 40]; second, keeping n xed at
1000, varying d through 40 equispaced points in [100, 1000] and k through 40 equispaced points
in [2, 100]. The resulting phase diagrams appear in panels (a),(b) of Figure 7. Careful inspection
of the results indicates that the empirical phase transition for RSE agrees well with the formula
d/(2 log(n)) that describes the behavior in the USE case. This is further veried by examining
Table 4, which has MSE and R
2
measures for the deviation of the empirical phase transition
from the threshold d/(2 log(n)), for dierent coecient ensembles. In summary, these empirical
results give strong evidence to support the conclusion that the Homotopy algorithm has the
k-step solution property when applied to the the problem suite o(RSE; d, n, k) as well. An
important implication of this conclusion is that in practice, we may use matrices drawn from the
RSE to replace matrices from the USE without loss of performance. This may result in reduced
complexity and increased eciency, as matrices composed of random signs can be generated and
applied much more rapidly than general matrices composed of real numbers.
Coecient Ensemble MSE R
2
Uniform 0.12 0.98
Gauss 0.11 0.98
Bernoulli 0.08 0.99
Table 4: Deviation of estimated
k
d,n
=
k
/d
(a) Uniform nonzero distribution
PFE
PHE
URPE
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
= d/n
=
k
/d
(b) Gaussian nonzero distribution
PFE
PHE
URPE
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
= d/n
=
k
/d
(c) Bernoulli nonzero distribution
PFE
PHE
URPE
Figure 10: Estimated Phase Transition Curves. Each panel displays empirical phase transition
curves for the PFE (solid), PHE (dashed), URPE (dash-dotted), with (a) Uniform nonzero
distribution; (b) Gaussian nonzero distribution; (c) Bernoulli nonzero distribution.
First, notice that at the right end of the phase diagram, we have ( = 1) = 1. In other
words, when the matrix A is square, Homotopy always recovers the sparsest solution in
k steps. This is hardly surprising, since, at = 1, the matrix A is orthogonal, and the
data y is orthogonal to columns of A o the support of
0
, ensuring correct term selection.
In addition, the update step d
; d, n, k), with k (
1
+1)/2.
Then, Lars runs k steps and stops, delivering the solution
0
.
Similar conclusions can be made about the assertions of Empirical Findings 1 and 2.
7.3 LARS OMP
We already hinted at the fact that Homotopy and Omp share an inherent anity. We now
demonstrate that the two methods are based on the same fundamental procedure; solving a
29
sequence of least-squares problems on an increasingly large subspace, dened by the active
columns of A.
More precisely, assume we are at the -th iteration, with a solution estimate x
1
and an
active set I consisting of 1 terms. Recall the procedure underlying Omp: It appends a new
term to the active set by selecting the vector a
j
that maximizes the absolute residual correlation;
j = arg max
j / I
[a
T
j
(y Ax
1
)[, and I := I
(I) = A
T
I
y, (7.12)
thus guaranteeing that the residual is orthogonal to the subspace 1(A
I
). Once y is spanned by
the active columns, the algorithm terminates with zero residual.
While Lars was derived from a seemingly disparate perspective, it in fact follows the same
basic procedure, replacing (7.12) with a penalized least-squares problem. Recalling the dis-
cussion in Section 2, we note that Lars also selects a new term according to the maximum
correlation criterion, and then solves
A
T
I
A
I
x
(I) = A
T
I
y
s, (7.13)
where s is a vector of length , recording the signs of residual correlations of each term at the
point it entered the active set, and
; d, n, k);
the statement of theorem 1 hold for Omp as well (see, e.g., Corollary 3.6 in [52]). Second,
for the random problem suite o(USE; d, n, k); Tropp et al. proved results which can now be
reinterpreted as saying that, when k d/(c log(n)), then with high probability for large n, Omp
has the k-step solution property, paralleling the statement of Empirical Finding 1. We believe,
in fact, that the bound k d/(2 log(n)) holds for Omp as well.
To summarize, we exhibited a series of transformations, starting with
1
minimization, and
ending with greedy pursuit. Each transformation is characterized clearly, and maintains the
k-step property of the solution path. We believe this sequence claries initially surprising simi-
larities between results about
1
minimization and Omp.
8 Homotopy and PFP
Polytope Faces Pursuit was introduced as a greedy algorithm to solve (P
1
) in the dual space. As
pointed out in [43, 42], although the algorithm is derived from a seemingly dierent viewpoint,
it exhibits interesting similarities to Homotopy. To study these in more detail, we rst briey
describe the algorithm.
Dene the standard-form linear program equivalent to (P
1
)
(
P
1
) min1
T
x subject to y =
A x, x 0,
where
A = [A A]. The equivalence of (P
1
) and (
P
1
) is well established; the solutions x of
(
P
1
) and x of (P
1
) have the correspondence x(j) = maxx(j), 0 for 1 j n, and x(j) =
maxx(j n), 0 for n+1 j 2n. The Lagrangian dual of the linear program (
P
1
) is given
by
(
D
1
) max y
T
v subject to
A
T
v 1.
30
By the strong duality theorem, an optimal solution x of the primal problem (
P
1
) is associated
with an optimal solution v to the dual problem (
D
1
), such that 1
T
x = y
T
v. Pfp operates in
the dual space, tracing a solution to the dual problem (
D
1
), which then yields a solution to the
primal problem (
P
1
) (or, equivalently, (P
1
)). To do so, the algorithm maintains an active set,
I, of nonzero coordinates, and carries out a familiar procedure at each iteration: Selection of
a new term to enter the active set, removal of violating terms from the active set (if any), and
computation of a new solution estimate. Specically, At the -th stage, Pfp rst selects a new
term to enter the active set, via the criterion
i
= arg max
i / I
a
T
i
r
1 a
T
i
v
a
T
i
r
> 0
, (8.14)
with r
= y
A x
(i)
1 a
T
i
v
, (8.17)
where v
=
A
I
d
(I) =
A
I
(
A
T
I
A
I
)
1
1 as in (8.16) above, and min
+
indicates that the minimum is
taken over positive arguments. Comparing (8.17) with (8.14), we note that the main dierence
lies in the residual; since Pfp projects the data y onto the space spanned by the active vectors,
denoted 1
I
, its residual is necessarily orthogonal to 1
I
. Homotopy, in contrast, does not
enforce orthogonality, and its residual can be expressed as the sum of a component in 1
I
and a
component in the orthogonal complement 1
I
. Formally,
r
= P
I
r
+ r
,
where P
I
=
A
I
(
A
T
I
A
I
)
1
A
T
I
is the orthogonal projector onto 1
I
, and r
= (I
d
P
I
)r
is the
component of r
in 1
I
. The component of r
in 1
I
obeys
P
I
r
=
A
I
(
A
T
I
A
I
)
1
c
= v
. (8.18)
Plugging this into (8.17), we get
i
+
= arg min
iI
c
+
a
T
i
(r
+ v
)
1 a
T
i
v
= arg min
iI
c
+
a
T
i
r
1 a
T
i
v
= arg max
iI
c
a
T
i
r
1 a
T
i
v
a
T
i
r
> 0
,
31
Figure 11: Phase Diagrams of Pfp for the Problem Suite o(USE,Uniform; d, n, k). Shaded
attribute is the proportion of successful executions among total trials, for: (a) k-step solution
property; (b) correct term selection property; (c) sign agreement property. Overlaid on each
plot are curves for d/(2 log(n)) (solid),
k
d,n
0.25
(dashed), and the condence bound for
k
d,n
0.25
with
95% condence level (dotted).
where the last equality holds because
A
T
v
= r
= y P
I
y, as long as no elements are removed from the active set,
r
PFP
= r
; d, n, k), with k (
1
+1)/2.
Then, Pfp runs k steps and stops, delivering the solution
0
.
Similarly, Empirical Findings 1 and 2 hold for Pfp as well. Figure 11 displays phase dia-
grams for the k-step solution property (Panel (a)), correct term selection property (Panel (b)),
and sign agreement property (Panel (c)), when Pfp is applied to instances of the problem
suite o(USE,Uniform; d, n, k). The three properties exhibit clear phase transition at around
d/(2 log(n)). Analogously, the statements in Empirical Findings 2 can be veried to hold for
Pfp as well; we do not bring such evidence here for brevity of space.
9 Recovery of Sparse Solutions
Previous sections examined various properties of Homotopy, all centered around the k-step
solution property. The motivation is clear; The Homotopy algorithm is at peak eciency
when it satises the k-step solution property. Yet, we noted that failure to terminate in k steps
need not imply failure to recover the sparsest solution. We now turn to study this latter property
in more detail.
Denition 5 An algorithm / has the k-sparse Solution Property at a given problem instance
(A, y) drawn from o(E,V; d, n, k) if, when applied to the data (A, y), the algorithm terminates
with the solution
0
.
32
Earlier papers used a dierent term for the same concept [18, 15, 55]; they would use the term
0
equivalence to indicate recovery of the sparsest solution, i.e. equivalence to the solution of
the combinatorial optimization problem
(P
0
) min
x
|x|
0
subject to y = Ax.
For the Homotopy algorithm, the k-sparse solution property has been well-studied. Indeed,
since the Homotopy algorithm is guaranteed to solve (P
1
), it will have the k-sparse solution
property whenever the solutions to (P
0
) and (P
1
) coincide. In a series of papers [15, 13, 17],
Donoho has shown that for problem instances from the suite o(USE; d, n, k), the equivalence
between solutions of (P
0
) and (P
1
) exhibits a clear phase transition, denoted
W
, that can be
computed numerically. This phase transition provides a theoretical prediction for the condi-
tions under which Homotopy satises the k-sparse solution property. Yet, at present, no such
predictions exist for Lars, Omp, or Pfp.
We now present the results of an extensive simulation study, examining the validity of the
k-sparse solution property for Homotopy, Lars, Omp, and Pfp, comparing the performance of
these four closely related algorithms. The study we conduct closely parallels the experiments de-
scribed above. In particular, we measure the k-sparse solution property for the three algorithms,
applied to problems instances from the suites o(USE,Uniform; d, n, k), o(USE,Gauss; d, n, k)
and o(USE,Bernoulli; d, n, k). We generated 100 independent trials at each point on the pa-
rameter grid, indexed by = d/n and = k/d, with ranging through 40 equispaced points in
[0.1, 1], ranging through 40 equispaced points in [0.05, 1], and n xed at 1000. The resulting
phase diagrams appear in Figure 12. Parallel results summarizing a similar experiment done for
the URPE are presented in Figure 13. Preliminary inspection of these phase diagrams reveals
that the suggestive similarities between the three algorithms indeed translate into comparable
performance in recovering the sparsest solution. More careful examination commands several
important observations.
Similarity of Lars and Homotopy. Throughout all the displayed phase diagrams, the
empirical phase transition of Lars closely follows the transition of Homotopy. In other
words, almost anytime
1
minimization successfully recovers the sparsest solution, Lars
succeeds in nding it as well. This is a rather surprising nd, as, at least in the USE case,
for k > d/(2 log(n)), Lars is bound to err and add elements not in the support of
0
into its active set. The evidence we present here suggests that inside the
1
-
0
equivalence
phase, Lars still correctly recovers the support of
0
and obtains the sparsest solution.
Universality of Lars and Homotopy. Examining Figure 12, we note that for the
suite o(USE; d, n, k), the performance of Lars and Homotopy is independent of the
underlying distribution of the nonzero coecients. Universality does not hold for Omp,
whose performance is aected by the distribution of the nonzeros. In particular, Omps
worst performance is recorded when the underlying nonzeros have constant amplitude and
a random sign pattern; this has been noted previously in the analysis of Omp [15, 52].
Increased Performance with o(URPE; d, n, k). Comparison of Figure 13 with Figure
12 reveals that, in general, all four algorithms perform better when applied to problem
instances from the suite o(URPE; d, n, k). Universality, however, no longer holds for this
problem suite. In particular, Homotopy and Lars show a marked increase in performance
when applied to the suite o(URPE,Uniform; d, n, k). In our experiments, we observed
similar phenomena when applying the three algorithms to problems drawn from the suites
33
o(PFE; d, n, k) and o(PHE; d, n, k) as well. This is indeed an important nding, as these
suites are used extensively in applications.
10 Stylized Applications
Earlier, we gave theoretical and empirical evidence showing that in certain problem scenarios,
when sucient sparsity is present, the Homotopy path and the Lars path agree. We now
present examples inspired by concrete applications in NMR spectroscopy and MR imaging, that
demonstrate the performance of the two algorithms in solving sparse approximation problems.
In accord with earlier sections, we shall compare and contrast the results with those obtained
by Omp.
10.1 Spectroscopy
Nuclear Magnetic Resonance (NMR) Spectroscopy is a widely used method of structure de-
termination in modern chemistry; see [35]. In its essence, NMR spectroscopy is based on the
behavior of atomic nuclei in a magnetic eld. Thus, an NMR spectrometer records the resonance
frequencies of nuclei when exposed to a strong magnetic eld and irradiated with an RF pulse.
The resulting signal is a superposition of all excited frequencies. For an caricature of 1-D NMR
spectra, consider the signal Bumps, depicted in Figure 14(a). It is a synthetic signal, made up of
a few peaks at varying amplitudes. Figure 14(b) shows its wavelet expansion on a scale-by-scale
basis. Clearly, Bumps has relatively few signicant wavelet coecients, and it admits a sparse
representation in a wavelet basis.
As a rst example, we consider reconstruction of NMR spectra from partial frequency infor-
mation [46, 47]. In detail, let x be an NMR signal of length n, which admits sparse synthesis in
a wavelet basis, i.e. = Wx is sparse, with W a wavelet analysis operator. Let be a d n
matrix drawn from the partial Fourier ensemble, i.e. constructed by selecting d rows of an nn
Fourier matrix at random. At the sampling stage, we compute y = x; In words, y is computed
by sampling d frequencies of x at random. At the reconstruction stage, we solve:
(CS
1
) min||
1
subject to y = W
T
,
where we used the fact that W is an orthogonal operator, whence x = W
T
. Readers familiar
with the notion of Compressed Sensing (CS) [14, 56] will recognize (CS
1
) as a particular case of
a Compressed Sensing problem. Indeed, work by Donoho [14] has shown that when the signal
x is compressible (i.e. has a sparse representation in a certain basis), the solution of (CS
1
) is
faithful to the original signal x.
For the purposes of this simulation, the signal Bumps was digitized to n = 1024 samples. We
measured d = 512 random frequency measurements, and applied the Homotopy algorithm to
solve (CS
1
), as well as Lars, Omp, and Pfp. Figure 15 summarizes the results at several stages
of execution, and Table 5 has corresponding error and timing measures. The results indicate
that all four algorithms faithfully recovered the signal, at roughly the same error. Of the four
algorithms, Omp was fastest, though only marginally, compared to Homotopy and Lars. We
also note that, owing to the rapid coecient decay of the wavelet expansion of Bumps, already
after 200 iterations, we obtain accurate reconstructions, at a fraction of the time it takes the
execution to complete. This suggests that, in practice, if we have knowledge of the sparsity
structure of the solution, we may enforce early termination without compromising the quality
of reconstruction. As the example in the next section illustrates, the savings are particularly
signicant in large-scale applications.
34
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
= d/n
=
k
/
d
(a) Nonzero coefficient distribution U[0,1]
Homotopy
LARS
OMP
PFP
W
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
= d/n
=
k
/
d
(b) Nonzero coefficient distribution N[0,1]
Homotopy
LARS
OMP
PFP
W
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
= d/n
=
k
/
d
(c) Nonzero coefficient distribution random +/1
Homotopy
LARS
OMP
PFP
W
Figure 12: The
0
Equivalence Property for the Suite o(USE,V; d, n, k). Each panel has plots
of the empirical phase transition curves of Homotopy (dashed), Lars (dash-dotted), Omp
(dotted) and Pfp (solid) alongside the theoretical curve
W
, on a - grid, with n = 1000, for:
(a) V = Uniform; (b) V = Gauss; (c) V = Bernoulli.
35
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
= d/n
=
k
/
d
(a) Nonzero coefficient distribution U[0,1]
Homotopy
LARS
OMP
PFP
W
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
= d/n
=
k
/
d
(b) Nonzero coefficient distribution N[0,1]
Homotopy
LARS
OMP
PFP
W
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
= d/n
=
k
/
d
(c) Nonzero coefficient distribution random +/1
Homotopy
LARS
OMP
PFP
W
Figure 13: The
0
Equivalence Property for the Suite o(URPE,V; d, n, k). Each panel has plots
of the empirical phase transition curves of Homotopy (dashed), Lars (dash-dotted), Omp
(dotted) and Pfp (solid) alongside the theoretical curve
W
, on a - grid, with n = 1000, for:
(a) V = Uniform; (b) V = Gauss; (c) V = Bernoulli.
36
100 200 300 400 500 600 700 800 900 1000
1
0
1
2
3
4
5
(a) Signal Bumps, n = 1024
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
10
8
6
4
2
(b) Wavelet analysis
Figure 14: (a) Signal Bumps, with n = 1024 samples; (b) Its wavelet expansion.
200 400 600 800 1000
0
2
4
Homotopy solution at 100 iters.
200 400 600 800 1000
0
2
4
LARS solution at 100 iters.
200 400 600 800 1000
0
2
4
OMP solution at 100 iters.
200 400 600 800 1000
0
2
4
PFP solution at 100 iters.
200 400 600 800 1000
0
2
4
Homotopy solution at 200 iters.
200 400 600 800 1000
0
2
4
LARS solution at 200 iters.
200 400 600 800 1000
0
2
4
OMP solution at 200 iters.
200 400 600 800 1000
0
2
4
PFP solution at 200 iters.
200 400 600 800 1000
0
2
4
Final Homotopy solution
200 400 600 800 1000
0
2
4
Final LARS solution
200 400 600 800 1000
0
2
4
Final OMP solution
200 400 600 800 1000
0
2
4
Final PFP solution
Figure 15: CS reconstruction of Bumps: Left to right: Solutions of Homotopy, Lars, Omp,
and Pfp; Top to bottom: Solutions after 100 iterations, 200 iterations, and at termination.
10.2 MR Imaging
Magnetic Resonance Imaging (MRI) is a technique based on the same principles as NMR spec-
troscopy. It is used primarily in medical settings to produce high quality images of internal
organs in the human body. Unlike NMR, here we are interested in the spatial location of pro-
tons exposed to a magnetic eld. Acquisition of an MR image involves adapting the gradients
of the external magnetic eld to get a spatial frequency map of the exposed organ. As MRI
37
Algorithm After 100 iters. After 200 iters. Final result
relerr
2
Time relerr
2
Time relerr
2
Time
Homotopy 0.280 0.6 secs 0.105 1.5 secs 0.052 12.4 secs
Lars 0.280 0.5 secs 0.105 1.5 secs 0.049 10.7 secs
Omp 0.120 0.4 secs 0.021 1.3 secs 0.017 7.0 secs
Pfp 0.154 1.5 secs 0.053 3.7 secs 0.050 19.5 secs
Table 5: Error measures and execution times for CS reconstruction of Bumps with Homotopy,
Lars and Omp.
often deals with enormous volumes of data, acquisition time is a key factor in the success of this
technique. Thus, the ability to reconstruct a faithful image from partial frequency information
is of great benet.
The scenario we consider in this section parallels the spectroscopy example, with some no-
table dierences. Most importantly, rather than acquiring 1-D spectra, here we need to map
a 3-D volume corresponding to a certain part of the human body. This implies that a huge
mass of data needs to be sampled and subsequently reconstructed. Thus, speed becomes a
crucial factor; computationally-intensive methods which are acceptable choices for 1-D recon-
struction are rendered useless in the 3-D case. In an attempt to reduce acquisition time and
speed up computations, traditional methods employ frequency sampling on coarser grids, or only
at low frequencies; the results are often blurred or aliased. Here we shall follow the Compressed
Sensing paradigm and sample Fourier space at random, with subsequent reconstruction via
1
minimization.
In detail, we treat the 3-D volume as an array of 2-D slices. Let x
j
denote one such slice,
column-stacked into a vector of length n. Just as we did in the 1-D case, at the sampling stage,
we collect data y
j
= x
j
, where is a dn matrix drawn from the partial Fourier ensemble. In
words, we sample d frequencies of x
j
at random. Note that in practice, this frequency sampling
is done physically by the MRI scanner; this operation is simulated here with a multiplication
by . At the reconstruction stage, we solve (CS
1
), with the wavelet operator W replaced by its
2-D counterpart. We do so for each slice x
j
, until the entire 3-D volume is reconstructed.
In our simulations, we used real data provided to us courtesy of Neal Bangerter of the
MRSRL Lab at Stanford University, see [1] for more details. It is a 3-D volume of a part of a
human leg, containing 230 slices of dimensions 128128. Figure 16, panels (a),(d) and (g), show
X- and Y-axis projections of the data along with one vertical slice. Clearly, computing the full
Homotopy solution would be prohibitively time-consuming, considering the data size. Hence, to
approximately solve (CS
1
) at the reconstruction stage, we performed only 800 iterations of Lars.
For comparative purposes, we also performed 800 iterations of Omp. The results are depicted
in the two rightmost columns of Figure 16. Indeed, we see that 800 iterations were sucient to
obtain a visually faithful result. In particular, all the major arteries appearing in the original
data are also present in the reconstructed data. Comparison of the reconstructions obtained
by Lars and Omp reveals that the reconstruction result of Omp tends to suer from streaking
artifacts, hindering the readability of the nal result. Thus, qualitatively, Lars produced a
better reconstruction.
10.3 Component Separation in a Noisy Environment
In Section 2, we commented that Homotopy may be easily adapted to treat noisy data, by
terminating at a prescribed point on the Homotopy path. We now illustrate the use of this
38
(a) Xaxis MIP of Original Data (b) Xaxis MIP of LARS Output at 800 Iterations (c) Xaxis MIP of OMP Output at 800 Iterations
(d) Yaxis MIP of Original Data (e) Yaxis MIP of LARS Output at 800 Iterations (f) Yaxis MIP of OMP Output at 800 Iterations
(g) Original Vertical Slice (h) LARS Output at 800 Iterations (i) OMP Output at 800 Iterations
Figure 16: CS reconstruction of 3-D MRI data: Left column: X- and Y-axis projections of the
original data, and one vertical slice; Middle column: corresponding reconstructions with Lars;
Right column: corresponding reconstructions with Omp.
strategy, in the context of component separation. In the particular example we present here,
we consider the problem of separating a signal contaminated with white noise into its har-
monic and impulsive components. This relatively simple problem, explored in [9, 20], inspired
more ambitious applications such as texture-geometry separation [48] and astronomical image
representation [49].
In detail, let y
0
be a signal of length d, composed of a few sinusoids and a few spikes, with a
total of k such components, k d. Note that y may be written as y = A
0
, with A = [I F] a
d 2d matrix representing an overcomplete dictionary composed of the d d identity matrix I,
and the d d Fourier matrix F.
0
is then a 2d coecient vector, with |
0
|
0
= k. In our noisy
setting, y
0
is not directly observable for measurement. Instead, we measure data y = y
0
+z, with
the noise power |z|
2
n
. To solve this underdetermined system for the time and frequency
components, we apply the Homotopy algorithm, stopping when the residual gets below the
noise level
n
.
Figure 17 presents the results of this simulation. Panel (a) displays a signal y
0
of length
d = 512, composed of 2 harmonic terms superposed with 40 spikes, with amplitudes distributed
normally. These are plotted in panels (b) and (c). Panel (d) shows the observed data, buried
in white gaussian noise, with a signal to noise ratio of 3. Thus, the noise power is one third
the signal power. Panels (e)-(h) on the bottom row of Figure 17 show the corresponding output
of the Homotopy algorithm, applied to the data y. As evident from the gure, Homotopy
manages to recover most of the components of
0
, leaving mostly noise in the residual. The
39
0 200 400
2
1.5
1
0.5
0
0.5
1
1.5
2
(a) Original Signal, d = 512
0 200 400
2
1.5
1
0.5
0
0.5
1
1.5
2
(b) Time Component
0 200 400
2
1.5
1
0.5
0
0.5
1
1.5
2
(c) Frequency Component
0 200 400
2
1.5
1
0.5
0
0.5
1
1.5
2
(d) Contaminated Signal, SNR = 3
0 200 400
2
1.5
1
0.5
0
0.5
1
1.5
2
(e) Homotopy Solution
0 200 400
2
1.5
1
0.5
0
0.5
1
1.5
2
(f) Time Component of Solution
0 200 400
2
1.5
1
0.5
0
0.5
1
1.5
2
(g) Frequency Component of Solution
0 200 400
2
1.5
1
0.5
0
0.5
1
1.5
2
(h) Residual
Figure 17: Time-Frequency Separation in the presence of noise: Top row: The original signal, its
time and frequency components, and the noisy signal with SNR = 3. Bottom row: Corresponding
output of Homotopy.
relative reconstruction error is |
0
|
2
/|
0
|
2
= 0.19. In summary, as this example illustrates,
Homotopy may be straightforwardly adapted to deal with noisy data, to successfully recover
sparse solutions of underdetermined systems.
11 Accompanying Software
SparseLab is a library of Matlab functions we make available on the web at http://sparselab.stanford.edu.
It has been developed in the spirit of reproducible research [4, 10], namely the idea that publica-
tion of computational research demands also the publication of the complete software ennviron-
ment needed for reproducing that research. SparseLab has been used by the authors to create
the gures and tables used in this article, and the toolbox contains scripts which will reproduce
all the calculations of this paper. The current release includes about 400 Matlab les, datasets,
and demonstration scripts.
12 On Approximate
1
Minimization Algorithms
In the belief that true
1
minimization is far too slow for large-scale applications, numerous
authors have explored approximate algorithms based on iterative thresholding or similar ideas;
examples include [23, 11, 48, 49, 25]; related ideas include projection onto convex sets [6]. The
reader should keep in mind that there are applications where such methods indeed operate faster
40
than the methods discussed here, while still producing a faithful approximate solution; see, for
example, Section 9 of [23]. On the other hand, the present methods for
1
minimization are
much faster than general purpose optimizers, so much of the apparent advantage claimed for
approximate
1
schemes has been neutralized by the algorithms studied here. Furthermore,
approximate methods inevitably trade o aptitude for speed; in general, the range of sparsities
in which such methods reliably recover the sparsest solution is considerably limited, compared to
1
minimization. A further advantage of the Homotopy viewpoint is the stimulus it provides for
future research: the empirical phenomena identied here may lead to deep research in polytope
theory, perhaps comparable to the kinds of results in [22].
13 Conclusions
In this paper, we study the Homotopy algorithm, introduced by Osborne et al. [40] and Efron
et al. [24]. It was originally proposed in the context of overdetermined noisy systems. We
propose here to use Homotopy to obtain sparse solutions of underdetermined systems of linear
equations. Homotopy follows a piecewise linear solution path, starting from the zero solution,
and terminating at the solution of (P
1
). It maintains an active set of column vectors of the
matrix A, adding or removing a vector at each step, and updating the solution accordingly. It
is a particularly eective algorithm for solving
1
minimization problems when the underlying
solution is sparse, and the underlying system is incoherent or random. In such scenarios, it signif-
icantly outperforms state-of-the-art general LP solvers. Moreover, its computational complexity
is then comparable to greedy stepwise algorithms such as Omp. In fact, the similarity between
the Homotopy, Lars, and Omp algorithms illuminates the parallelism evident in results on
the performance of
1
minimization and Omp in recovering sparse solutions to underdetermined
linear systems.
Such claims we made concrete by formally showing that for two matrix classes, determin-
istic incoherent matrices and random uniform spherical matrices, Homotopy has the follow-
ing k-step solution property: If the data admit sparse representation using k columns of A,
Homotopy nds it in exactly k steps. We proved that for problems drawn from the suite
o(Inc
1
solution is also the sparsest solution. Comm. Pure Appl. Math., 59(7):907934, July 2006.
[16] David L. Donoho. For most underdetermined systems of linear equations, the minimal
1
-norm near-solution approximates the sparsest near-solution. Comm. Pure Appl. Math.,
59(6):797829, June 2006.
[17] David L. Donoho. High-dimensional centrosymmetric polytopes with neighborliness pro-
portional to dimension. Discrete and Computational Geometry, 35(4):617652, May 2006.
[18] David L. Donoho and Michael Elad. Optimally sparse representation from overcomplete
dictionaries via
1
norm minimization. Proc. Natl. Acad. Sci. USA, 100(5):21972002,
March 2003.
[19] David L. Donoho, Michael Elad, and Vladimir N. Temlyakov. Stable recovery of sparse
overcomplete representations in the presence of noise. IEEE Transactions on Information
Theory, 52(1):618, January 2006.
[20] David L. Donoho and Xiaoming Huo. Uncertainty principles and ideal atomic decomposi-
tion. IEEE Trans. Info. Theory, 47(7):28452862, 2001.
[21] David L. Donoho and Benjamin F. Logan. Signal recovery and the large sieve. SIAM J.
Appl. Math., 52(2):577591, 1992.
[22] David L. Donoho and Jared Tanner. Counting faces of randomly-projected polytopes when
the projection radically lowers dimension, 2006.
[23] David L. Donoho, Yaakov Tsaig, Iddo Drori, and Jean-Luc Starck. Sparse solution of
underdetermined linear equations by stagewise orthogonal matching pursuit. Submitted,
2006.
[24] Bradley Efron, Trevor Hastie, Iain M. Johnstone, and Robert Tibshirani. Least angle
regression. The Annals of Statistics, 32(2):407499, 2004.
[25] M. Elad, B. Matalon, and M. Zibulevsky. Coordinate and subspace optimization methods
for linear least squares with non-quadratic regularization. Appl. Comp. Harm. Anal., to
appear.
[26] Michael Elad and Alfred M. Bruckstein. A generalized uncertainty principle and sparse
representations in pairs of bases. IEEE Transactions on Information Theory, 48(9):2558
2567, September 2002.
[27] Jean J. Fuchs. On sparse representations in arbitrary redundant bases. IEEE Transactions
on Information Theory, 50(6):13411344, June 2004.
43
[28] A. Y. Garnaev and E. D. Gluskin. On widths of the euclidean ball. Soviet Mathematics
Doklady, 30:200203, 1984. (in English).
[29] Anna C. Gilbert, S. Guha, P. Indyk, S. Muthukrishnan, and M. Strauss. Near-optimal
sparse fourier representations via sampling. In Proc. 34th ACM symposium on Theory of
Computing, pages 152161. ACM Press, 2002.
[30] Gene H. Golub and Charles F. Van Loan. Matrix computations (3rd ed.). Johns Hopkins
University Press, Baltimore, MD, USA, 1996.
[31] Remi Gribonval and Morten Nielsen. Sparse representations in unions of bases. IEEE
Trans. Info. Theory, 49(12):13201325, 2003.
[32] Remi Gribonval and Morten Nielsen. Highly sparse representations from dictionaries are
unique and independent of the sparseness measure. preprint, 2006.
[33] M. Harwit and N.J.A. Sloane. Hadamard Transform Optics. Academic Press, 1979.
[34] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learn-
ing. Springer-Verlag, 2001.
[35] Jerey C. Hoch and Alan Stern. NMR Data Processing. Wiley-Liss, 1996.
[36] Boris S. Kashin. Diameters of certain nite-dimensional sets in classes of smooth functions.
Izv. Akad. Nauk SSSR, Ser. Mat., 41(2):334351, 1977.
[37] Michael Lustig, Jin H. Lee, David L. Donoho, and John M. Pauly. Faster imaging with
randomly perturbed spirals and
1
reconstruction. In Proc. of the 13th ISMRM, 2004.
[38] Michael Lustig, J.M. Santos, David L. Donoho, and John M. Pauly. Rapid mr angiography
with randomly under-sampled 3dft trajectories and non-linear reconstruction. In Proc. of
the 9th SCMR, 2006.
[39] Dmitry M. Malioutov, M ujdat C etin, and Alan S. Willsky. Homotopy continuation for
sparse signal representation. In IEEE Int. Conf. Acoustics, Speech and Signal Processing,
Philadelphia, PA, volume 5, pages 733736, March 2005.
[40] Michael R. Osborne, Brett Presnell, and Berwin A. Turlach. A new approach to variable
selection in least squares problems. IMA J. Numerical Analysis, 20:389403, 2000.
[41] Michael R. Osborne, Brett Presnell, and Berwin A. Turlach. On the lasso and its dual.
Journal of Computational and Graphical Statistics, 9:319337, 2000.
[42] M. D. Plumbley. Geometry and homotopy for l1 sparse representations. In Proceedings of
SPARS05, Rennes, France, pages 206213, November 2005.
[43] M. D. Plumbley. Recovery of sparse representations by polytope faces pursuit. In Proceed-
ings of the 6th International Conference on Independent Component Analysis and Blind
Source Separation (ICA 2006), Charleston, SC, USA, pages 206213, March 2006.
[44] Mark Rudelson and Roman Vershynin. Geometric approach to error-correcting codes and
reconstruction of signals. Technical report, Department of Mathematics, Univ. of California
at Davis, 2005.
44
[45] Michael A. Saunders and Byunggyoo Kim. Pdco: Primal-dual interior method for convex
objectives. http://www.stanford.edu/group/SOL/software/pdco.html.
[46] Peter Schmieder, Alan S. Stern, Gerhard Wagner, and Jerey C. Hoch. Application of non-
linear sampling schemes to cosy-type spectra. Journal of Biomolecular NMR, 3(134):569
576, 1993.
[47] Peter Schmieder, Alan S. Stern, Gerhard Wagner, and Jerey C. Hoch. Improved resolution
in triple-resonance spectra by nonlinear sampling in the constant-time domain. Journal of
Biomolecular NMR, 4(188):483490, 1994.
[48] Jean-Luc Starck, Michael Elad, and David L. Donoho. Image decomposition: Separation
of texture from piecewise smooth content. In SPIE annual meeting, San Diego, California,
USA, December 2003.
[49] Jean-Luc Starck, Michael Elad, and David L. Donoho. Redundant multiscale transforms
and their application for morphological component analysis. Advances in Imaging and
Electron Physics, 132:287348, 2004.
[50] Thomas Strohmer and Robert Heath Jr. Grassmannian frames with applications to coding
and communcations. Appl. Comp. Harm. Anal., 14(3):257275, 2003.
[51] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society, 58(1):267288, 1996.
[52] Joel A. Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Trans-
actions on Information Theory, 50(11):22312242, October 2004.
[53] Joel A. Tropp. Just relax: Convex programming methods for subset selection and sparse
approximation. IEEE Transactions on Information Theory, 51(3):10301051, March 2006.
[54] Joel A. Tropp and Anna C. Gilbert. Signal recovery from partial information by orthogonal
matching pursuit. Manuscript, 2005.
[55] Yaakov Tsaig and David L. Donoho. Breakdown of equivalence between the minimal
1
-
norm solution and the sparsest solution. Signal Processing, 86(3):533548, 2006.
[56] Yaakov Tsaig and David L. Donoho. Extensions of compressed sensing. Signal Processing,
86(3):549571, 2006.
45