Fast Solution of ' - Norm Minimization Problems When The Solution May Be Sparse

Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

Fast Solution of

1
-norm Minimization Problems
When the Solution May be Sparse
David L. Donoho

and Yaakov Tsaig

October 2006
Abstract
The minimum
1
-norm solution to an underdetermined system of linear equations y = Ax,
is often, remarkably, also the sparsest solution to that system. This sparsity-seeking property
is of interest in signal processing and information transmission. However, general-purpose
optimizers are much too slow for
1
minimization in many large-scale applications.
The Homotopy method was originally proposed by Osborne et al. for solving noisy
overdetermined
1
-penalized least squares problems. We here apply it to solve the noiseless
underdetermined
1
-minimization problem min |x|
1
subject to y = Ax. We show that
Homotopy runs much more rapidly than general-purpose LP solvers when sucient sparsity
is present. Indeed, the method often has the following k-step solution property: if the
underlying solution has only k nonzeros, the Homotopy method reaches that solution in
only k iterative steps. When this property holds and k is small compared to the problem
size, this means that
1
minimization problems with k-sparse solutions can be solved in a
fraction of the cost of solving one full-sized linear system.
We demonstrate this k-step solution property for two kinds of problem suites. First,
incoherent matrices A, where o-diagonal entries of the Gram matrix A
T
A are all smaller
than M. If y is a linear combination of at most k (M
1
+ 1)/2 columns of A, we
show that Homotopy has the k-step solution property. Second, ensembles of d n random
matrices A. If A has iid Gaussian entries, then, when y is a linear combination of at most
k < d/(2 log(n)) (1
n
) columns, with
n
> 0 small, Homotopy again exhibits the k-step
solution property with high probability. Further, we give evidence showing that for ensembles
of dn partial orthogonal matrices, including partial Fourier matrices, and partial Hadamard
matrices, with high probability, the k-step solution property holds up to a dramatically higher
threshold k, satisfying k/d < (d/n), for a certain empirically-determined function ().
Our results imply that Homotopy can eciently solve some very ambitious large-scale
problems arising in stylized applications of error-correcting codes, magnetic resonance imag-
ing, and NMR spectroscopy. Our approach also sheds light on the evident parallelism in
results on
1
minimization and Orthogonal Matching Pursuit (OMP), and aids in explaining
the inherent relations between Homotopy, LARS, OMP, and Polytope Faces Pursuit.
Key Words and Phrases: LASSO. LARS. Homotopy Methods. Basis Pursuit.
1
min-
imization. Orthogonal Matching Pursuit. Polytope Faces Pursuit. Sparse Representations.
Underdetermined Systems of Linear Equations.
Acknowledgements. This work was supported in part by grants from NIH, ONR-MURI,
and NSF DMS 00-77261, DMS 01-40698 (FRG) and DMS 05-05303. We thank Michael Lustig
and Neal Bangerter of the Magnetic Resonance Systems Research Lab (MRSRL) at Stanford
University for supplying us with 3-D MRI data for use in our examples. YT would like to thank
Michael Saunders for helpful discussions and references.

Department of Statistics, Stanford University, Stanford CA, 94305

Institute for Computational and Mathematical Engineering, Stanford University, Stanford CA, 94035
1
1 Introduction
Recently,
1
-norm minimization has attracted attention in the signal processing community
[9, 26, 20, 7, 5, 6, 8, 27, 31, 32, 44], as an eective technique for solving underdetermined
systems of linear equations, of the following form. A measurement vector y R
d
is generated
from an unknown signal of interest x
0
R
n
by a linear transformation y = Ax
0
, with a known
matrix A of size d n. An important factor in these applications is that d < n, so the system
of equations y = Ax is underdetermined. We solve the convex optimization problem
(P
1
) min
x
|x|
1
subject to y = Ax,
obtaining a vector x
1
which we consider an approximation to x
0
.
Applications of (P
1
) have been proposed in the context of time-frequency representation [9],
overcomplete signal representation [20, 26, 27, 31, 32], texture/geometry separation [48, 49],
compressed sensing (CS) [14, 7, 56], rapid MR Imaging [37, 38], removal of impulsive noise
[21], and decoding of error-correcting codes (ECC) [5, 44]. In such applications, the underlying
problem is to obtain a solution to y = Ax which is as sparse as possible, in the sense of having
few nonzero entries. The above-cited literature shows that often, when the desired solution x
0
is suciently sparse, (P
1
) delivers either x
0
or a reasonable approximation.
Traditionally, the problem of nding sparse solutions has been catalogued as belonging to a
class of combinatorial optimization problems, whereas (P
1
) can be solved by linear programming,
and is thus dramatically more tractable in general. Nevertheless, general-purpose LP solvers
ultimately involve solution of full n n linear systems, and require many applications of such
solvers, each application costing order O(n
3
) ops.
Many of the interesting problems where (P
1
) has been contemplated are very large in scale.
For example, already 10 years ago, Basis Pursuit [9] tackled problems with d = 8, 192 and
n = 262, 144, and problem sizes approached currently [55, 7, 56, 38] are even larger. Such
large-scale problems are simply too large for general-purpose strategies to be used in routine
processing.
1.1 Greedy Approaches
Many of the applications of (P
1
) can instead be attacked heuristically by tting sparse models,
using greedy stepwise least squares. This approach is often called Matching Pursuit or Orthogo-
nal Matching Pursuit (Omp) [12] in the signal processing literature. Rather than minimizing an
objective function, Omp constructs a sparse solution to a given problem by iteratively building
up an approximation; the vector y is approximated as a linear combination of a few columns of
A, where the active set of columns to be used is built column by column, in a greedy fashion.
At each iteration a new column is added to the active set the column that best correlates with
the current residual.
Although Omp is a heuristic method, in some cases it works marvelously. In particular, there
are examples where the data y admit sparse synthesis using only k columns of A, and greedy
selection nds exactly those columns in just k steps. Perhaps surprisingly, optimists can cite
theoretical work supporting this notion; work by Tropp et al. [52, 54] and Donoho et al. [19] has
shown that Omp can, in certain circumstances, succeed at nding the sparsest solution. Yet,
theory provides comfort for pessimists too; Omp fails to nd the sparsest solution in certain
scenarios where
1
minimization succeeds [9, 15, 52].
Optimists can also rely on empirical evidence to support their hopeful attitude. By now
several research groups have conducted studies of Omp on problems where A is random with
2
iid Gaussian elements and x
0
is sparse but random (we are aware of experiments by Tropp,
by Petukhov, as well as ourselves) all with the conclusion that Omp despite its greedy and
heuristic nature performs roughly as well as (P
1
) in certain problem settings. Later in this
paper we document this in detail. However, to our knowledge, there is no theoretical basis
supporting these empirical observations.
This has the air of mystery why do (P
1
) and Omp behave so similarly, when one has a
rigorous foundation and the other one apparently does not? Is there some deeper connection
or similarity between the two ideas? If there is a similarity, why is Omp so fast while
1
minimization is so slow?
In this paper we aim to shed light on all these questions.
1.2 A Continuation Algorithm To Solve (P
1
)
In parallel with developments in the signal processing literature, there has also been interest in
the statistical community in tting regression models while imposing
1
-norm constraints on the
regression coecients. Tibshirani [51] proposed the so-called Lasso problem, which we state
using our notation as follows:
(L
q
) min
x
|y Ax|
2
2
subject to |x|
1
q;
in words: a least-squares t subject to an
1
-norm constraint on the coecients. In Tibshiranis
original proposal, A
dn
was assumed to have d > n, i.e. representing an overdetermined linear
system.
It is convenient to consider instead the unconstrained optimization problem
(D

) min
x
|y Ax|
2
2
/2 + |x|
1
,
i.e. a form of
1
-penalized least-squares. Indeed, problems (L
q
) and (D

) are equivalent under


an appropriate correspondence of parameters. To see that, associate with each problem (D

) :
[0, ) a solution x

(for simplicity assumed unique). The set x

: [0, ) identies
a solution path, with x

= 0 for large and, as 0, x

converging to the solution of (P


1
).
Similarly, x
q
: q [0, ) traces out a solution path for problem (L
q
), with x
q
= 0 for q = 0
and, as q increases, x
q
converging to the solution of (P
1
). Thus, there is a reparametrization
q() dened by q() = | x

|
1
so that the solution paths of (D

) and (L
q()
) coincide.
In the classic overdetermined setting, d > n, Osborne, Presnell and Turlach [40, 41] made
the useful observation that the solution path of (D

), 0 (or (L
q
), q 0) is polygonal.
Further, they characterized the changes in the solution x

at vertices of the polygonal path.


Vertices on this solution path correspond to solution subset models, i.e. vectors having nonzero
elements only on a subset of the potential candidate entries. As we move along the solution
path, the subset is piecewise constant as a function of , changing only at critical values of ,
corresponding to the vertices on the polygonal path. This evolving subset is called the active
set. Based on these observations, they presented the Homotopy algorithm, which follows the
solution path by jumping from vertex to vertex of this polygonal path. It starts at x

= 0 for
large, with an empty active set. At each vertex, the active set is updated through the addition
and removal of variables. Thus, in a sequence of steps, it successively obtains the solutions
x

at a special problem-dependent sequence

associated to vertices of the polygonal path.


The name Homotopy refers to the fact that the objective function for (D

) is undergoing a
homotopy from the
2
constraint to the
1
objective as decreases.
3
Malioutov et al. [39] were the rst to apply the Homotopy method to the formulation (D

)
in the underdetermined setting, when the data are noisy. In this paper, we also suggest using
the Homotopy method for solving (P
1
) in the underdetermined setting.
Homotopy Algorithm for Solving (P
1
): Apply the Homotopy method: follow the solution
path from x

0
= 0 to x
0
. Upon reaching the = 0 limit, (P
1
) is solved.
Traditionally, to solve (P
1
), one would apply the simplex algorithm or an interior-point
method, which, in general, starts out with a dense solution and converges to the solution of (P
1
)
through a sequence of iterations, each requiring the solution of a full linear system. In contrast,
the Homotopy method starts out at x

0
= 0, and successively builds a sparse solution by adding
or removing elements from its active set. Clearly, in a sparse setting, this latter approach is
much more favorable, since, as long as the solution has few nonzeros, Homotopy will reach the
solution in a few steps.
Numerically, each step of the algorithm involves the rank-one update of a linear system,
and so if the whole procedure stops in k steps, yielding a solution with k nonzeros, its overall
complexity is bounded by k
3
+kdn ops. For k d and d n, this is far better than the d
3
/3
ops it would take to solve just one d d linear system. Moreover, to solve (D

) for all 0
by a traditional approach, one would need to repeatedly solve a quadratic program for every
value of interest. For any problem size beyond the very smallest, that would be prohibitively
time consuming. In contrast, Homotopy delivers all the solutions x

to (D

), 0.
1.3 LARS
Efron, Hastie, Johnstone, and Tibshirani [24] developed an approximation to the Homotopy
algorithm which is quite instructive. The Homotopy algorithm maintains an active set of
nonzero variables composing the current solution. When moving to a new vertex of the solution
polygon, the algorithm may either add new elements to or remove existing elements from the
active set. The Lars procedure is obtained by following the same sequence of steps, only
omitting the step that considers removal of variables from the active set, thus constraining its
behavior to adding new elements to the current approximation. In other words, once activated,
a variable is never removed.
In modifying the stepwise rule, of course, one implicitly obtains a new polygonal path, the
Lars path, which in general may be dierent from the Lasso path. Yet, Efron et al. observed
that in practice, the Lars path is often identical to the Lasso path. This equality is very
interesting in the present context, because Lars is so similar to Omp. Both algorithms build up
a model a step at a time, adding a new variable to the active set at each step, and ensuring that
the new variable is in some sense the most important among the potential candidate variables.
The details of determining importance dier but in both cases involve the inner product between
the candidate new variables and the current residual.
In short, a stepwise algorithm with a greedy avor can sometimes produce the same result
as full-blown
1
minimization. This suggests a possibility which can be stated in two dierent
ways:
... that an algorithm for quasi
1
minimization runs just as rapidly as Omp.
... that an algorithm visibly very similar to Omp can be just as eective as
1
minimization.
4
1.4 Properties of Homotopy
Another possibility comes to mind. The Homotopy algorithm is, algorithmically speaking,
a variant of Lars, diering only in a few lines of code. And yet this small variation makes
Homotopy rigorously able to solve a global optimization problem.
More explicitly, the dierence between Homotopy and Lars lies in the provision by Ho-
motopy for terms to leave as well as enter the active set. This means that the number of steps
required by Homotopy can, in principle, be signicantly greater than the number of steps re-
quired by Lars, as terms enter and leave the active set numerous times. If so, we observe model
churning which causes Homotopy to run slowly. We present evidence that under favorable
conditions, such churning does not occur. In such cases, the Homotopy algorithm is roughly
as fast as both Lars and Omp.
In this paper, we consider two settings for performance measurement: deterministic incoher-
ent matrices and random matrices. We demonstrate that, when a k-sparse representation exists,
k d, the Homotopy algorithm nds it in k steps. Results in each setting parallel existing
results about Omp in that setting. Moreover, the discussion above indicates that each step of
Homotopy is identical to a step of Lars and therefore very similar to a corresponding step of
Omp. We interpret this parallelism as follows.
The Homotopy algorithm, in addition to solving the
1
minimization problem,
runs just as rapidly as the heuristic algorithms Omp/ Lars, in those problem suites
where those algorithms have been shown to correctly solve the sparse representation
problem.
Since there exists a wide range of cases where
1
-minimization is known to nd the sparsest
solution while Omp is known to fail [9, 15, 52], Homotopy gives in some sense the best of both
worlds by a single algorithm: speed where possible, sparsity where possible.
In addition, our viewpoint sheds new perspective on previously-observed similarities between
results about
1
minimization and about Omp. Previous theoretical work [19, 52, 54] found that
these two seemingly disparate methods had comparable theoretical results for nding sparse
solutions. Our work exhibits inherent similarities between the two methods which motivate
these parallel ndings.
Figure 1 summarizes the relation between
1
minimization and Omp, by highlighting the inti-
mate bonds between Homotopy, Lars, and Omp. We elaborate on these inherent connections
in a later section.
1.5 Main Results
Our results about the stopping behavior of Homotopy require some terminology.
Denition 1 A problem suite o(E,V; d, n, k) is a collection of problems dened by three com-
ponents: (a) an ensemble E of matrices A of size d n; (b) an ensemble V of n-vectors
0
with
at most k nonzeros; (c) an induced ensemble of left-hand sides y = A
0
.
Where the ensemble V is omitted from the denition of a problem suite, we interpret that as
saying that the relevant statements hold for any nonzero distribution. Otherwise, V will take
one of three values: (Uniform), when the nonzeros are iid uniform on [0, 1]; (Gauss), when the
nonzeros are iid standard normal; and (Bernoulli), when the nonzeros are iid equi-probable
Bernoulli distributed with values 1.
5




!
l
1
minimization
HOMOTOPY LARS
OMP
(1) (2) (3)
Figure 1: Bridging
1
minimization and OMP. (1) Homotopy provably solves
1
minimization
problems [24]. (2) Lars is obtained fromHomotopy by removing the sign constraint check. (3)
Omp and Lars are similar in structure, the only dierence being that Omp solves a least-squares
problem at each iteration, whereas Lars solves a linearly-penalized least-squares problem. There
are initially surprising similarities of results concerning the ability of OMP and
1
minimization
to nd sparse solutions. We view those results as statements about the k-step solution property.
We present evidence showing that for suciently small k, steps (2) and (3) do not aect the
threshold for the k-sparse solution property. Since both algorithms have the k-step solution
property under the given conditions, they both have the sparse solution property under those
conditions.
Denition 2 An algorithm / has the k-step solution property at a given problem instance (A, y)
drawn from a problem suite o(E,V; d, n, k) if, when applied to the data (A, y), the algorithm
terminates after at most k steps with the correct solution.
Our main results establish the k-step solution property in specic problem suites.
1.5.1 Incoherent Systems
The mutual coherence M(A) of a matrix A whose columns are normalized to unit length is the
maximal o-diagonal entry of the Gram matrix A
T
A. We call the collection of matrices A with
M(A) the incoherent ensemble with coherence bound (denoted Inc

). Let o(Inc

; d, n, k)
be the suite of problems with d n matrices drawn from the incoherent ensemble Inc

, with
vectors
0
having |
0
|
0
k. For the incoherent problem suite, we have the following result:
Theorem 1 Let (A, y) be a problem instance drawn from o(Inc

; d, n, k). Suppose that


k (
1
+ 1)/2. (1.1)
Then, the Homotopy algorithm runs k steps and stops, delivering the solution
0
.
The condition (1.1) has appeared elsewhere; work by Donoho et al. [20, 18], Gribonval and
Nielsen [31], and Tropp [52] established that when (1.1) holds, the sparsest solution is unique
and equal to
0
, and both
1
minimization and Omp recover
0
. In particular, Omp takes at
most k steps to reach the solution.
In short we show here that for the general class of problems o(Inc

; d, n, k), where a unique


sparsest solution is known to exist, and where Omp nds it in k steps, Homotopy nds that
6
same solution in the same number of steps. Note that Homotopy always solves the
1
minimiza-
tion problem; the result shows that it operates particularly rapidly over the sparse incoherent
suite.
1.5.2 Random Matrices
We rst consider d n random matrices A from the Uniform Spherical Ensemble (USE). Such
matrices have columns independently and uniformly distributed on the sphere S
d1
. Consider
the suite o(USE; d, n, k) of random problems where A is drawn from the USE and where
0
has
at most k nonzeros at k randomly-chosen sites, with the sites chosen independently of A.
Empirical Finding 1 Fix (0, 1). Let d = d
n
= n|. Let (A, y) be a problem instance
drawn from o(USE; d, n, k). There is
n
> 0 small, so that, with high probability, for
k
d
n
2 log(n)
(1
n
), (1.2)
the Homotopy algorithm runs k steps and stops, delivering the solution
0
.
Remarks:
We conjecture that corresponding to this empirical nding is a theorem, in which the
conclusion is that as n ,
n
0.
The empirical result is in fact stronger. We demonstrate that for problem instances with
k emphatically not satisfying (1.2), but rather
k
d
n
2 log(n)
(1 +
n
),
then, with high probability for large n, the Homotopy algorithm fails to terminate in k
steps. Thus, (1.2) delineates a boundary between two regions in (k, d, n) space; one where
the k-step property holds with high probability, and another where the chance of k-step
termination is very low.
This empirical nding also holds for matrices drawn from the Random Signs Ensemble
(RSE). Matrices in this ensemble are constrained to have their entries 1 (suitably nor-
malized to obtain unit-norm columns), with signs drawn independently and with equal
probability.
We next consider ensembles made of partial orthogonal transformations. In particular, we
consider the following matrix ensembles:
Partial Fourier Ensemble (PFE). Matrices in this ensemble are obtained by sampling at
random d rows of an n by n Fourier matrix.
Partial Hadamard Ensemble (PHE). Matrices in this ensemble are obtained by sampling
at random d rows of an n by n Hadamard matrix.
Uniform Random Projection Ensemble (URPE). Matrices in this ensemble are obtained
by generating an n by n random orthogonal matrix and sampling d of its rows at random.
7
As it turns out, when applied to problems drawn from these ensembles, the Homotopy al-
gorithm maintains the k-step property at a substantially greater range of signal sparsities. In
detail, we give evidence supporting the following conclusion.
Empirical Finding 2 Fix (0, 1). Let d = d
n
= n|. Let (A, y) be a problem instance
drawn from o(E,V; d, n, k), with E PFE,PHE,URPE , and V Uniform,Gauss,Bernoulli
. The empirical function () depicted in Figure 8, marks a transition such that, for k/d
n

(), the Homotopy algorithm runs k steps and stops, delivering the solution
0
.
A third, and equally important, nding we present involves the behavior of the Homotopy
algorithm when the k-step property does not hold. We provide evidence to the following.
Empirical Finding 3 Let (A, y) be a problem instance drawn from o(E,V; d, n, k), with E
PFE,PHE,URPE , V Uniform,Gauss,Bernoulli , and (d, n, k) chosen so that the k-
step property does not hold. With high probability, Homotopy, when applied to (A, y), operates
in either of two modes:
1. It recovers the sparse solution
0
in c
s
d iterations, with c
s
a constant depending on E, V ,
empirically found to be less than 1.6.
2. It does not recover
0
, returning a solution to (P
1
) in c
f
d iterations, with c
f
a constant
depending on E, V , empirically found to be less than 4.85.
We interpret this nding as saying that we may safely apply Homotopy to problems whose
solutions are not known a priori to be sparse, and still obtain a solution rapidly. In addition, if
the underlying solution to the problem is indeed sparse, Homotopy is particularly ecient in
nding it. This nding is of great usefulness in practical applications of Homotopy.
1.6 An Illustration: Error Correction in a Sparse Noisy Channel
The results presented above have far-reaching implications in practical scenarios. We demon-
strate these with a stylized problem inspired by digital communication systems, of correcting
gross errors in a digital transmission. Recently, researchers pointed out that
1
-minimization/sparsity
ideas have a role to play in decoding linear error-correcting codes over Z [5, 44], a problem
known, in general, to be NP-hard. Applying the Homotopy algorithm in this scenario, we now
demonstrate the following remarkable property, an implication of Empirical Finding 2.
Corollary 1 Consider a communications channel corrupting a fraction < 0.24 of every n
transmitted integers, the corrupted sites chosen by a Bernoulli() iid selection. For large n,
there exists a rate-1/5 code, that, with high probability, can be decoded, without error, in at most
.24n Homotopy steps.
In other words, even when as many as a quarter of the transmitted entries are corrupted,
Homotopy operates at peak performance, recovering the encoded bits in a xed number of
steps, known in advance. This feature enables the design of real-time decoders obeying a hard
computational budget constraint, with a xed number of ops per decoding eort.
Let us describe the coding strategy in more detail. Assume is a digital signal of length
p, with entries 1, representing bits to be transmitted over a digital communication channel.
Prior to transmission, we encode with a rate 1/5 code constructed in the following manner.
Let H be a n n Hadamard matrix, n 5p. Construct the encoding matrix E by selecting
8
0.1 0.2 0.3 0.4
0.05
0.1
0.15
0.2
0.25
0.3
0.35

#

i
t
e
r
a
t
i
o
n
s

/

n
(a) Gaussian noise
0.1 0.2 0.3 0.4
0.05
0.1
0.15
0.2
0.25
0.3
0.35

#

i
t
e
r
a
t
i
o
n
s

/

n
(b) Bernoulli noise
0.1 0.2 0.3 0.4
0.05
0.1
0.15
0.2
0.25
0.3
0.35

#

i
t
e
r
a
t
i
o
n
s

/

n
(c) Cauchy noise
0.1 0.2 0.3 0.4
0.05
0.1
0.15
0.2
0.25
0.3
0.35

#

i
t
e
r
a
t
i
o
n
s

/

n
(d) Rayleigh noise
Figure 2: Performance of Homotopy in decoding a rate-1/5 partial Hadamard code. Each
panel shows the number of iterations, divided by n, that Homotopy takes in order to recover
the coded signal, versus , the fraction of corrupt entries in the transmitted signal, with the
noise distributed (a) N(0,1); (b) 1 with equal probabilities; (c) Cauchy; (d) Rayleigh. In each
plot o indicates k-step recovery, and x implies recovery in more than k steps.
p rows of H at random. The decoding matrix D is then composed of the other 4p rows of H.
The encoding stage amounts to computing x = E
T
, with x the encoded signal, of length n. At
the decoder, we receive r = x+z, where z has nonzeros in k random positions, and the nonzero
follow a specied distribution. At the decoder, we apply Homotopy to solve
(EC
1
) min||
1
subject to D = Dr;
Call the solution . The decoder output is then

= sgn(E
T
(r )).
The key property being exploited is the mutual orthogonality of D and E. Specically, note
that Dr = D(E
T
+ z) = Dz. Hence, (EC
1
) is essentially solving for the sparse error patten.
To demonstrate our claim, we set p = 256, and considered error rates k = n, with varying
in [0.04, 0.4]. In each case we generated a problem instance as described above, and measured the
number of steps Homotopy takes to reach the correct solution. Results are plotted in Figure
2, showing four dierent noise distributions. Inspecting the results, we make two important
observations. First, as claimed, for the noise distributions considered, as long as the fraction
of corrupt entries is less that 0.24n, Homotopy recovers the correct solution in k steps, as
implied by Empirical Finding 2. Second, as increases, the k-step property may begin to fail,
but Homotopy still does not take many more than k steps, in accord with Empirical Finding
3.
9
1.7 Relation to Earlier Work
Our results seem interesting in light of previous work by Osborne et al. [40] and Efron et
al. [24]. In particular, our proposal for fast solution of
1
minimization amounts simply to
applying to (P
1
) previously known algorithms (Homotopy a.k.a Lars/Lasso) designed for
tting equations to noisy data. We make the following contributions.
Few Steps. Efron et al. considered the overdetermined noisy setting, and remarked in
passing that while, in principle Homotopy could take many steps, they had encountered
examples where it took a direct path to the solution, in which a term, once entered into
the active set, stayed in the active set [24]. Our work in the noiseless underdetermined
setting formally identies a precise phenomenon, namely, the k-step solution property, and
delineates a range of problem suites where this phenomenon occurs.
Similarity of Homotopy and Lars. Efron et al., in the overdetermined noisy setting,
commented that, while in principle the solution paths of Homotopy and Lars could be
dierent, in practice they were came across examples where Homotopy and Lars yielded
very similar results [24]. Our work in the noiseless case formally denes a property which
implies that Homotopy and Lars have the same solution path, and delineates a range
of problem suites where this property holds. In addition, we present simulation studies
showing that, over a region in parameter space, where
1
minimization recovers the sparsest
solution, Lars also recovers the sparsest solution.
Similarity of Homotopy and Omp. In the random setting, a result of Tropp and Gilbert
[54] as saying that Omp recovers the sparsest solution under a condition similar to (1.2),
although with a somewhat worse constant term; their result (and proof) can be interpreted
in the light of our work as saying that Omp actually has the k-step solution property under
their condition.
In fact it has the k-step solution property under the condition (1.2). Below, we elaborate
on the connections between Homotopy and Omp leading to these similar results.
Similarity of Homotopy and Polytope Faces Pursuit. Recently, Plumbley introduced a
greedy algorithm to solve (P
1
) in the dual space, which he named Polytope Faces Pursuit
(Pfp) [43, 42]. The algorithm bears strong resemblance to Homotopy and Omp. Below,
we elaborate on Plumbleys work, to show that the anities are not coincidental, and
in fact, under certain conditions, Pfp is equivalent to Homotopy. We then conclude
that Pfp maintains the k-step solution property under the same conditions required for
Homotopy.
In short, we provide theoretical underpinnings, formal structure, and empirical ndings. We
also believe our viewpoint claries the connections between Homotopy, Lars, and Pfp.
1.8 Contents
The paper is organized as follows. Section 2 reviews the Homotopy algorithm in detail. Section
3 presents running-time simulations alongside a formal complexity analysis of the algorithm, and
discusses evidence leading to Empirical Finding 3. In Section 4 we prove Theorem 1, the k-step
solution property for the sparse incoherent problem suite. In Section 5 we demonstrate Empirical
Finding 1, the k-step solution property for the sparse USE problem suite. Section 6 follows, with
evidence to support Empirical Finding 2. In Section 7 we elaborate on the relation between
1
10
minimization and Omp hinted at in Figure 1. This provides a natural segue to Section 8, which
discusses the connections between Homotopy and Pfp. Section 9 then oers a comparison of
the performance of Homotopy, Lars, Omp and Pfp in recovering sparse solutions. In Section
10, we demonstrate the applicability of Homotopy and Lars with examples inspired by NMR
spectroscopy and MR imaging. Section 11 discusses the software accompanying this paper.
Section 12 briey discusses approximate methods for recovery of sparse solutions, and a nal
section has concluding remarks.
2 The Homotopy Algorithm
We shall start our exposition with a description of the Homotopy algorithm. Recall that the
general principle undergirding homotopy methods is to trace a solution path, parametrized by
one or more variables, while evolving the parameter vector from an initial value, for which the
corresponding solution is known, to the desired value. For the Lasso problem (D

), this implies
following the solution x

, starting at large and x

= 0, and terminating when 0 and x

converging to the solution of (P


1
). The solution path is followed by maintaining the optimality
conditions of (D

) at each point along the path.


Specically, let f

(x) denote the objective function of (D

). By classical ideas in convex


analysis, a necessary condition for x

to be a minimizer of f

(x) is that 0
x
f

(x

), i.e. the
zero vector is an element of the subdierential of f

at x

. We calculate

x
f

(x

) = A
T
(y Ax

) + |x

|
1
, (2.3)
where |x

|
1
is the subgradient
|x

|
1
=

u R
n

u
i
= sgn(x
,i
), x
,i
,= 0
u
i
[1, 1], x
,i
= 0

.
Let I = i : x

(i) ,= 0 denote the support of x

, and call c = A
T
(y Ax

) the vector of residual


correlations. Then, equation (2.3) can be written equivalently as the two conditions
c(I) = sgn(x

(I)), (2.4)
and
[c(I
c
)[ , (2.5)
In words, residual correlations on the support I must all have magnitude equal to , and signs
that match the corresponding elements of x

, whereas residual correlations o the support must


have magnitude less than or equal to . The Homotopy algorithm now follows from these two
conditions, by tracing out the optimal path x

that maintains (2.4) and (2.5) for all 0. The


key to its successful operation is that the path x

is a piecewise linear path, with a discrete


number of vertices [24, 40].
The algorithm starts with an initial solution x
0
= 0, and operates in an iterative fashion,
computing solution estimates x

, = 1, 2, . . .. Throughout its operation, it maintains the active


set I, which satises
I = j : [c

(j)[ = |c

= , (2.6)
as implied by conditions (2.4) and (2.5). At the -th stage, Homotopy rst computes an update
direction d

, by solving
A
T
I
A
I
d

(I) = sgn(c

(I)), (2.7)
11
with d

set to zero in coordinates not in I. This update direction ensures that the magnitudes of
residual correlations on the active set all decline equally. The algorithm then computes the step
size to the next breakpoint along the homotopy path. Two scenarios may lead to a breakpoint.
First, that a non-active element of c

would increase in magnitude beyond , violating (2.5).


This rst occurs when

= min
iI
c

(i)
1 a
T
i
v

,
+ c

(i)
1 + a
T
i
v

, (2.8)
where v

= A
I
d

(I), and the minimum is taken only over positive arguments. Call the minimizing
index i
+
. The second scenario leading to a breakpoint in the path occurs when an active
coordinate crosses zero, violating the sign agreement in (2.4). This rst occurs when

= min
iI
x

(i)/d

(i), (2.9)
where again the minimum is taken only over positive arguments. Call the minimizing index i

.
Homotopy then marches to the next breakpoint, determined by

= min
+

, (2.10)
updates the active set, by either appending I with i
+
, or removing i

, and computes a solution


estimate
x

= x
1
+

.
The algorithm terminates when |c

= 0, and the solution of (P


1
) has been reached.
Remarks:
1. In the algorithm description above, we implicitly assume that at each breakpoint on the
homotopy path, at most one new coordinate is considered as candidate to enter the active
set (Efron et al. called this the one at a time condition [24]). If two or more vectors are
candidates, more care must be taken in order to choose the correct subset of coordinates
to enter the active set; see [24] for a discussion.
2. In the introduction, we noted that the Lars scheme closely mimics Homotopy, the main
dierence being that Lars does not allow removal of elements from the active set. Indeed,
to obtain the Lars procedure, one simply follows the sequence of steps described above,
omitting the computation of i

and

, and replacing (2.10) with

=
+

.
The resulting scheme adds a single element to the active set at each iteration, never
removing active elements from the set.
3. The Homotopy algorithm may be easily adapted to deal with noisy data. Assume that
rather than observing y
0
= A
0
, we observe a noisy version y = A
0
+z, with |z|
2

n
.
Donoho et al. [19, 16] have shown that, for certain matrix ensembles, the solution x
q
of
(L
q
) with q =
n
has an error which is at worst proportional to the noise level. To solve for
x
n
, we simply apply the Homotopy algorithm as described above, terminating as soon
as the residual satises |r

|
n
. Since in most practical scenarios it is not sensible to
assume that the measurements are perfectly noiseless, this attribute of Homotopy makes
it an apt choice for use in practice.
12
3 Computational Cost
In earlier sections, we claimed that Homotopy can solve the problem (P
1
) much faster than
general-purpose LP solvers, which are traditionally used to solve it. In particular, when the
k-step solution property holds, Homotopy runs in a fraction of the time it takes to solve one
full linear system, making it as ecient as fast stepwise algorithms such as Omp. To support
these assertions, we now present the results of running-time simulations, complemented by a
formal analysis of the asymptotic complexity of Homotopy.
3.1 Running Times
To evaluate the performance of the Homotopy algorithm applied to (P
1
), we compared its
running times on instances of the problem suite o(USE,Gauss; d, n, k) with two state-of-the-
art algorithms for solving Linear Programs. The rst, LP Solve, is a Mixed Integer Linear
Programming solver, implementing a variant of the Simplex algorithm [2]. The other, PDCO,
a Primal-Dual Convex Optimization solver, is a log-barrier interior point method, written by
Michael Saunders of the Stanford Optimization Laboratory [45]. Table 1 shows the running times
for Homotopy, LP Solve and PDCO, for various problem dimensions. The gures appearing
in the table were measured on a 3GHz Xeon workstation.
(d,n,k) Homotopy LP Solve PDCO
(200,500,20) 0.03 2.04 0.90
(250,500,100) 0.94 9.27 1.47
(500,1000,100) 1.28 45.18 4.62
(750,1000,150) 2.15 85.78 7.68
(500,2000,50) 0.34 59.52 6.86
(1000,2000,200) 7.32 407.99 22.90
(1600,4000,320) 31.70 2661.42 122.47
Table 1: Comparison of execution times (in seconds) of Homotopy, LP Solve and PDCO,
applied to instances of the random problem suite o(USE,Gauss; d, n, k).
Examination of the running times in the table suggests two important observations. First,
when the underlying solution to the problem (P
1
) is sparse, a tremendous saving in computation
time is achieved using Homotopy, compared to traditional LP solvers. For instance, when A
is 500 2000, and y admits sparse synthesis with k = 50 nonzeros, Homotopy terminates
in about 0.34 seconds, 20 times faster than PDCO, and over 150 times faster than LP Solve.
Second, when the linear system is highly underdetermined (i.e. d/n is small), even when the
solution is not sparse, Homotopy is more ecient than either LP Solve or PDCO. This latter
observation is of particular importance for applications, as it implies that Homotopy may be
safely used to solve
1
minimization problems even when the underlying solution is not known
to be sparse.
3.2 Complexity Analysis
The timing studies above are complemented by a detailed analysis of the complexity of the
Homotopy algorithm. We begin by noting that the bulk of computation is invested in the
solution of the linear system (2.7) at each iteration. Thus, the key to an ecient implemen-
tation is maintaining a Cholesky factorization of the gram minor A
T
I
A
I
, updating it with the
13
addition/removal of elements to/from the active set. This allows for fast solution for the up-
date direction; O(d
2
) operations are needed to solve (2.7), rather than the O(d
3
) ops it would
ordinarily take. In detail, let = [I[, where I denotes the current active set. The dominant
calculations per iteration are the solution of (2.7) at a cost of 2
2
ops, and computation of the
step to the next vertex on the solution path, using nd+6(n) ops. In addition, the Cholesky
factor of A
T
I
A
I
is either updated by appending a column to A
I
at a cost of
2
+ d + 2( + d)
ops, or downdated by removing a column of A
I
using at most 3
2
ops.
To conclude, without any sparsity constraints on the data, k Homotopy steps would take
at most 4kd
2
/3 +kdn+O(kn) ops. However, if the conditions for the k-step solution property
are satised, a more favorable estimate holds.
Lemma 1 Let (A, y) be a problem instance drawn from a suite o(E,V; d, n, k). Suppose the k-
step solution property holds, and the Homotopy algorithm performs k steps, each time adding
a single element into the active set. The algorithm terminates in k
3
+ kdn + 1/2k
2
(d 1) +
n(8k + d + 1) + 1/2k(5d 3) ops.
Notice that, for k d, the dominant term in the expression for the operation count is kdn, the
number of ops needed to carry out k matrix-vector multiplications. Thus, we may interpret
Lemma 1 as saying that, under such favorable conditions, the computational workload of Ho-
motopy is roughly proportional to k applications of a d n matrix. For comparison, applying
least-squares to solve the underdetermined linear system Ax = y would require 2d
2
n 2d
3
/3
operations [30]. Thus, for k d, Homotopy runs in a fraction of the time it takes to solve one
least-squares system.
To visualize this statement, panel (a) of Figure 3 displays the operation count of Homotopy
on a grid with varying sparsity and indeterminacy factors. In this simulation, we measured the
total operation count of Homotopy as a fraction of the solution of one dn least-squares system,
for random instances of the problem suite o(USE,Uniform; d, n, k). We xed n = 1000, varied
the indeterminacy of the system, indexed by = d/n, in the range [0.1, 1], and the sparsity
of the solution, indexed by = k/d, in the range [0.05, 1]. For reference, we superposed the
theoretical bound
W
below which the solution of (P
1
) is, with high probability, the sparsest
solution (see section 9 for more details). Close inspection of this plot reveals that, below this
curve, Homotopy delivers the solution to (P
1
) rapidly, much faster than it takes to solve one
d n least-squares problem.
3.3 Number of Iterations
Considering the analysis just presented, it is clear that the computational eciency of Homo-
topy greatly depends on the number of vertices on the polygonal Homotopy path. Indeed,
since it allows removal of elements from the active set, Homotopy may, conceivably, require
an arbitrarily large number of iterations to reach a solution. We note that this property is not
shared by Lars or Omp; owing to the fact that these algorithms never remove elements from
the active set, after d iterations they terminate with zero residual.
In [24], Efron et al. briey noted that they had observed examples where Homotopy does
not drop elements from the active set very frequently, and so, overall, idoesnt require many
more iterations than Lars to reach a solution. We explore this initial observation further,
and present evidence leading to Empirical Finding 3. Specically, consider the problem suite
o(E,V; d, n, k), with E USE,PFE,PHE,URPE , and V Uniform,Gauss,Bernoulli
. The space (d, n, k) of dimensions of underdetermined problems may be divided into three
regions:
14
Figure 3: Computational Cost of Homotopy. Panel (a) shows the operation count as a fraction
of one least-squares solution on a - grid, with n = 1000. Panel (b) shows the number of
iterations as a fraction of d = n. The superimposed dashed curve depicts the curve
W
, below
which Homotopy recovers the sparsest solution with high probability.
1. k-step region. When (d, n, k) are such that the k-step property holds, then Homotopy
successfully recovers the sparse solution in k steps, as suggested by Empirical Findings 1
and 2. Otherwise;
2. k-sparse region. When (d, n, k) are such that
1
minimization correctly recovers k-sparse
solutions but Homotopy takes more than k steps, then, with high probability, it takes
no more than c
s
d steps, with c
s
a constant depending on E, V , empirically found to be
1.6. Otherwise;
3. Remainder. With high probability, Homotopy does not recover the sparsest solution,
returning a solution to (P
1
) in c
f
d steps, with c
f
a constant depending on E, V , empirically
found to be 4.85.
We note that the so-called constants c
s
, c
f
depend weakly on and n.
A graphical depiction of this division of the space of admissible (d, n, k) is given in panel
(b) of Figure 3. It shows the number of iterations Homotopy performed for various (d, n, k)
congurations, as a shaded attribute on a grid indexed by and . Inspection of this plot
reveals that Homotopy performs at most 1.6 d iterations, regardless of the underlying
solution sparsity. In particular, in the region below the curve
W
, where, with high probablity,
Homotopy recovers the sparsest solution, it does so in less than d steps.
More extensive evidence is given in Table 2, summarizing the results of a comprehensive
study. We considered four matrix ensembles E, each coupled with three nonzero ensembles
V . For a problem instance drawn from a o(E,V; d, n, k), we recorded the number of iterations
required to reach a solution. We repeated this at many dierent (d, n, k) congurations, gen-
erating 100 independent realizations for each (d, n, k), and computing the average number of
iterations observed at each instance. Table 2 displays the estimated constants c
s
, c
f
for dierent
combinations of matrix ensemble and coecient ensemble. Thus, the results in Table 2 read,
e.g., Applied to a problem instance drawn from o(USE,Uniform; d, n, k), Homotopy takes,
with high probability, no more than 1.69 d iterations to obtain the minimum
1
solution.
15
Coe. Ensemble / Uniform Gauss Bernoulli
Matrix Ensemble c
s
c
f
c
s
c
f
c
s
c
f
USE 1.65 1.69 1.6 1.7 1.23 1.68
PFE 0.99 3.44 0.99 1.42 0.86 1.49
PHE 0.99 4.85 0.92 1.44 0.87 1.46
URPE 1.05 4.4 1 1.46 1 1.44
Table 2: Maximum number of Homotopy iterations, as a fraction of d, for various matrix /
coecient ensembles.
4 Fast Solution with the Incoherent Ensemble
Let the d n matrix A have unit-length columns a
j
, |a
j
|
2
= 1. The mutual coherence
M(A) = max
i=j
[a
i
, a
j
)[
measures the smallest angle between any pair of columns. As n > d, this angle must be greater
than zero: the columns cannot be mutually orthogonal; in fact, there is an established lower
bound on the coherence of A:
M(A)

n d
d(n 1)
,
Matrices which satisfy this lower bound with equality are known as optimal Grassmannian frames
[50]. There is by now much work establishing relations between the sparsity of a coecient vector

0
and the coherence of a matrix A needed for successful recovery of
0
via
1
minimization
[20, 18, 53, 31] or Omp [19, 52]. In detail, for a general matrix A with coherence M(A), both

1
minimization and OMP recover the solution
0
from data y = A
0
whenever the number of
nonzeros in
0
satises
|
0
|
0
< (M(A)
1
+ 1)/2. (4.1)
see, for example, Theorem 7 of [18] or Corollary 3.6 of [52]. Comparing (4.1) with (1.1), we see
that Theorem 1 essentially states that a parallel result holds for the Homotopy algorithm.
Before proceeding to the proof of the Theorem, we make a few introductory assumptions.
As in the above, we assume that the Homotopy algorithm operates with a problem instance
(A, y) as input, with y = A
0
and |
0
|
0
= k. To simplify notation, we assume, without loss
of generality, that
0
has its nonzeros in the rst k positions. Further, we operate under the
convention that at each step of the algorithm, only one vector is introduced into the active
set. If two or more vectors are candidates to enter the active set, assume the algorithm inserts
them one at a time, on separate stages. Finally, to x notation, let x

denote the Homotopy


solution at the -th step, r

= y Ax

denote the residual at that step, and c

= A
T
r

be the
corresponding residual correlation vector. To prove Theorem 1, we introduce two useful notions.
Denition 3 We say that the Homotopy algorithm has the Correct Term Selection Property
at a given problem instance (A, y), with y = A
0
, if at each iteration, the algorithm selects a
new term to enter the active set from the support set of
0
.
Homotopy has the correct term selection property if, throughout its operation, it builds the
solution using only correct terms. Thus, at termination, the support set of the solution is
guaranteed to be a subset of the support set of
0
.
16
Denition 4 We say that the Homotopy algorithm has the Sign Agreement Property at a
given problem instance (A, y), with y = A
0
, if at every step , for all j I,
sgn(x

(j)) = sgn(c

(j)).
In words, Homotopy has the sign agreement property if, at every step of the algorithm, the
residual correlations in the active set agree in sign with the corresponding solution coecients.
This ensures that the algorithm never removes elements from the active set.
These two properties are the fundamental building blocks needed to ensure that the k-
step solution property holds. In particular, these two properties are necessary and sucient
conditions for successful k-step termination, as the following lemma shows.
Lemma 2 Let (A, y) be a problem instance drawn from a suite o(E,V; d, n, k). The Homotopy
algorithm, when applied to (A, y), has the k-step solution property if and only if it has the correct
term selection property and the sign agreement property.
Proof. To argue in the forward direction, we note that, after k steps, correct term selection
implies that the active set is a subset of the support of
0
, i.e. I 1, ..., k. In addition, the
sign agreement property ensures that no variables leave the active set. Therefore, after k steps,
I = 1, ..., k and the Homotopy algorithm recovers the correct sparsity pattern. To show that
after k steps, the algorithm terminates, we note that, at the k-th step, the step-size
k
is chosen
so that, for some j I
c
,
[c
k
(j)
k
a
T
j
A
I
d
k
(I)[ =
k

k
, (4.2)
with
k
= |c
k
(I)|

. In addition, for the k-th update, we have


A
I
d
k
(I) = A
I
(A
T
I
A
I
)
1
A
T
I
r
k
= r
k
,
since r
k
is contained in the column space of A
I
. Hence
c
k
(I
c
) = A
T
I
cA
I
d
k
(I),
and
k
=
k
is chosen to satisfy (4.2). Therefore, the solution at step k has Ax
k
= y. Since y
has a unique representation in terms of the columns of A
I
, we may conclude that x
k
=
0
.
The converse is straightforward. In order to terminate with the solution
0
after k steps, the
Homotopy algorithm must select one term out of 1, ..., k at each step, never removing any
elements. Thus, violation of either the correct term selection property or the sign agreement
property would result in a number of steps greater than k or an incorrect solution.
Below, we will show that, when the solution
0
is suciently sparse, both these properties
hold as the algorithm traces the solution path. Theorem 1 then follows naturally.
4.1 Correct Term Selection
We now show that, when sucient sparsity is present, the Homotopy solution path maintains
the correct term selection property.
Lemma 3 Suppose that y = A
0
where
0
has only k nonzeros, with k satisfying
k (
1
+ 1)/2. (4.3)
17
Assume that the residual at the -th step can be written as a linear combination of the rst k
columns in A, i.e.
r

=
k

j=1

(j)a
j
.
Then the next step of the Homotopy algorithm selects an index from among the rst k.
Proof. We will show that at the -th step,
max
1ik
[r

, a
i
)[ > max
i>k
[r

, a
i
)[, (4.4)
and so at the end of the -th iteration, the active set is a subset of 1, ..., k.
Let G = A
T
A denote the gram matrix of A. Let

i = arg max
1ik
[
i
[. The left hand side
of (4.4) is bounded below by
max
1ik
[r

, a
i
)[ [r

, a

i
)[
[
k

j=1

(j)G

ij
[
[

i)[

j=

i
[G

ij
[[

(j)[
[

i)[ M(A)

j=

i
[

(j)[
[

i)[ M(A)(k 1)[

i)[, (4.5)
Here we used: |a
j
|
2
2
= 1 for all j and G

ij
M(A) for j ,=

i. As for the right hand side of (4.4),


we note that for i > k we have
[r

, a
i
)[
k

j=1
[

(j)[[G
ij
[
M(A)
k

j=1
[

(j)[
kM(A)[

i)[. (4.6)
Combining (4.5) and (4.6), we get that for (4.4) to hold, we need
1 (k 1)M(A) > kM(A). (4.7)
Since k was selected to exactly satisfy this bound, relation (4.4) follows.
Thus, when k satises (4.3), the Homotopy algorithm only ever considers indices among
the rst k as candidates to enter the active set.
4.2 Sign Agreement
Recall that an index is removed from the active set at step only if condition (2.4) is violated, i.e.
if the signs of the residual correlations c

(i), i I do not match the signs of the corresponding


elements of the solution x

. The following lemma shows that, when


0
is suciently sparse, an
18
even stronger result holds: at each stage of the algorithm, the residual correlations in the active
set agree in sign with the direction of change of the corresponding terms in the solution. In
other words, the solution moves in the right direction at each step. In particular, it implies that
throughout the Homotopy solution path, the sign agreement property is maintained.
Lemma 4 Suppose that y = A
0
, where
0
has only k nonzeros, with k satisfying
k (
1
+ 1)/2.
For 1, ..., k, let c

= A
T
r

, and let the active set I be dened as in (2.6). Then the update
direction d

dened by (2.7) satises,


sgn(d

(I)) = sgn(c

(I)). (4.8)
Proof. Let

= |c

. We will show that for i I, [d

(i) sgn(c

(i))[ < 1, implying that (4.8)


holds. To do so, we note that (2.7) can be rewritten as

(A
T
I
A
I
I
d
)d

(I) =

(I) + c

(I),
where I
d
is the identity matrix of appropriate dimension. This yields
|

(I) c

(I)|

|A
T
I
A
I
I
d
|
(,)
|

(I)|

1 M(A)
2
|

(I)|

1 M(A)
2
(|c

(I)|

+|

(I) c

(I)|

1 M(A)
2
(

+|

(I) c

(I)|

),
where | |[
(,)
denotes the induced

operator norm. Rearranging terms, we get


|

(I) c

(I)|

1 M
1 + M
<

,
thus,
|d

(I) sgn(c

(I))|

< 1.
Relation (4.8) follows.
4.3 Proof of Theorem 1
Lemmas 3 and 4 establish the correct term selection property and the sign agreement property
at a single iteration of the algorithm. We will now give an inductive argument showing that the
two properties hold at every step 1, ..., k.
In detail, the algorithm starts with x
0
= 0; by the sparsity assumption on y = A
0
, we may
apply Lemma 3 with r
0
= y, to get that at the rst step, a column among the rst k is selected.
Moreover, at the end of the rst step we have x
1
=
1
d
1
, and so, by Lemma 4,
sgn(x
1
(I)) =
1
(I).
By induction, assume that at step , only indices among the rst k are in the active set, i.e.
r

=
k

j=1

(j)a
j
,
19
and the sign condition (2.4) holds, i.e.
sgn(x

(I)) =

(I).
Applying Lemma 3 to r

, we get that at the (l +1)-th step, the term to enter the active set will
be selected from among the rst k indices. Moreover, for the updated solution we have
sgn(x
+1
(I)) = sgn(x

(I) +
+1
d
+1
(I)).
By Lemma 4 we have
sgn(d
+1
(I)) =
+1
(I).
We observe that
c
+1
= c

+1
A
T
Ad
+1
,
whence, on the active set,
[c
+1
(I)[ = [c

(I)[(1
+1
),
and so

+1
(I) =

(I).
In words, once a vector enters the active set, its residual correlation maintains the same sign.
We conclude that
sgn(x
+1
(I)) =
+1
(I). (4.9)
Hence no variables leave the active set. To summarize, both the correct term selection property
and the sign agreement property are maintained throughout the execution of the Homotopy
algorithm.
We may now invoke Lemma 2 to conclude that the k-step solution property holds. This
completes the proof of Theorem 1.
5 Fast Solution with the Uniform Spherical Ensemble
Suppose now that A is a random matrix drawn from the Uniform Spherical Ensemble. It is
not hard to show that such matrices A are naturally incoherent; in fact, for > 0, with high
probability for large n
M(A)

4 log(n)
d
(1 + ).
Thus, applying Theorem 1, we get as an immediate corollary that if k satises
k

d/ log(n) (1/4

), (5.10)
then Homotopy is highly likely to recover any sparse vector
0
with at most k nonzeros.
However, as it turns out, Homotopy operates much more eciently when applied to instances
from the random suite o(USE; d, n, k), than what is implied by incoherence. We now discuss
evidence leading to Empirical Finding 1. We demonstrate, through a comprehensive suite of
simulations, that the formula d/(2 log(n)) accurately describes the breakdown of the k-step
solution property. Before doing so, we introduce two important tools that will be used repeatedly
in the empirical analysis below.
20
5.1 Phase Diagrams and Phase Transitions
In statistical physics, a phase transition is the transformation of a thermodynamic system from
one phase to another. The distinguishing characteristic of a phase transition is an abrupt
change in one or more of its physical properties, as underlying parameters cross a region of
parameter space. Borrowing terminology, we say that property T exhibits a phase transition
at a sequence of problem suites o(E,V; d, n, k), if there is a threshold function Thresh(d, n) on
the parameter space (k, d, n), such that problem instances with k Thresh(d, n) have property
T with high probability, and problem instances above this threshold do not have property T
with high probability. As we show below, the the k-step solution property of Homotopy,
applied to problem instances drawn from o(USE; d, n, k), exhibits a phase transition at around
d/(2 log(n)). Thus, there is a sequence (
n
) with each term small and positive, so that for k <
(1
n
) d/(2 log(n)), Homotopy delivers the sparsest solution in k steps with high probability;
on the other hand, if k > (1 +
n
) d/(2 log(n)), then with high probablity, Homotopy, fails to
terminate in k steps.
To visualize phase transitions, we utilize phase diagrams. In statistical physics, a phase
diagram is a graphical depiction of a parameter space decorated by boundaries of regions where
certain properties hold. Here we use this term in an analogous fashion, to mean a graphical
depiction of a subset of the parameter space (d, n, k), illustrating the region where an algorithm
has a given property T, when applied to a problem suite o(E,V; d, n, k). It will typically take
the form of a 2-D grid, with the threshold function associated with the corresponding phase
transition dividing the grid into two regions: A region where property T occurs with high
probablity, and a region where T occurs with low probablity.
While phase transitions are sometimes discovered by formal mathematical analysis, more
commonly, the existence of phase transitions is unearthed by computer simulations. We now
describe an objective framework for estimating phase transitions from simulation data. For
each problem instance (A, y) drawn from a suite o(E,V; d, n, k), we associate a binary outcome
variable Y
d,n
k
, with Y
d,n
k
= 1 when property T is satised on that realization, and Y
d,n
k
=
0 otherwise. Within our framework, Y
d,n
k
is modeled as a Bernoulli random variable with
probability
d,n
k
. Thus, P(Y
d,n
k
= 1) =
d,n
k
, and P(Y
d,n
k
= 0) = 1
d,n
k
. Our goal is to estimate
a value

k
d,n
where the transition between these two states occurs. To do so, we employ a logistic
regression model on the mean response EY
d,n
k
with predictor variable k,
EY
d,n
k
=
exp(
0
+
1
k)
1 + exp(
0
+
1
k)
. (5.11)
Logistic response functions are often used to model threshold phenomena in statistical data
analysis [34]. Let
d,n
k
denote the value of the tted response function. We may then compute a
value

k
d,n

indicating that, with probability exceeding 1 , for a problem instance drawn from
o(E,V; d, n, k) with k <

k
d,n

, property T holds.

k is given by

k
= 1 .
Thus, computing

k
d,n

for a range of (d, n) values essentially maps out an empirical phase tran-
sition of property T. The value xes a location on the transition band below which we dene
the region of success. In our work, we set = 0.25.
5.2 The k-Step Solution Property
We now apply the method described in the previous section to estimate an empirical phase
transition for the k-step solution property of the Homotopy algorithm. Before doing so, we
21
Figure 4: k-Step Solution Property of Homotopy for the Problem Suite
o(USE,Uniform; d, n, k). Shaded attribute is the proportion of successful terminations,
out of 100 trials, in the case: (a) k = 1 . . . 40, d = 200, and n = 200 . . . 20000; (b) k = 2 . . . 100,
d = 100 . . . 1000, and n = 1000. Overlaid are curves for d/(2 log(n)) (solid),

k
d,n
0.25
(dashed), and
the condence bound for

k
d,n
0.25
with 95% condence level (dotted).
briey describe the experimental setup. For (d, n, k) given, we generated a problem instance
(A, y) drawn from the problem suite o(USE,Uniform; d, n, k). To this instance we applied the
Homotopy algorithm, and recorded the outcome of the response variable Y
d,n
k
. We did so in
two cases; rst, xing d = 200, varying n through 40 log-equispaced points in [200, 20000] and
k through 40 equispaced points in [1, 40]; second, keeping n xed at 1000, varying d through 40
equispaced points in [100, 1000] and k through 40 equispaced points in [2, 100]. For each (d, n, k)
instance we ran 100 independent trials. To estimate the regression coecients
0
,
1
in (5.11),
we used maximum likelihood estimation [34].
Figure 4 displays two phase diagrams, giving a succinct depiction of the simulation results.
Panel (a) displays a grid of (k, n) values, k [1, 40], n [200, 20000]. Each point on this grid
takes a value between 0 and 1, representing the ratio of successful outcomes to overall trials;
these are precisely the observed EY
d,n
k
, on a grid of (k, n) values, with d xed. Overlaid are
curves for the estimated

k
d,n

for each n with = 0.25, and the theoretical bound d/(2 log(n)).
Panel (b) has a similar conguration on a (k, d) grid, with n xed at 1000. Careful inspection
of these phase diagrams reveals that the estimated

k
d,n

closely follows the curve d/(2 log(n));


Panel (a) shows the logarithmic behavior of the phase transition as n varies, and panel (b) shows
its linear behavior as d varies.
Table 3 oers a summary of the results of this experiment, for the three nonzero ensembles
used in our simulations: Uniform, Gauss, and Bernoulli. In each case, two measures for
statistical goodness-of-t are reported: the mean-squared error (MSE) and the R
2
statistic.
The MSE measures the squared deviation of the estimated k values from the proposed formula
d/(2 log(n)). The R
2
statistic measures how well the observed phase transition ts the theo-
retical model; a value close to 1 indicates a good t. These statistical measures give another
indication that the empirical behavior is in line with theoretical predictions. Moreover, they
support the theoretical assertion that the k-step solution property of the Homotopy algorithm
is independent of the coecient distribution.
The results in Figure 4 and Table 3 all demonstrate consistency between the empirical phase
22
Coecient Ensemble MSE R
2
Uniform 0.09 0.98
Gauss 0.095 0.98
Bernoulli 0.11 0.98
Table 3: Deviation of estimated

k
d,n

from d/(2 log(n)) for o(USE,V; d, n, k), for dierent coef-


cient ensembles V .
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 10
4
1.5
1
0.5
0
n
(a) Regression coefficient
1
vs. n, d = 200
5 10 15 20 25 30 35 40
0
0.2
0.4
0.6
0.8
1
k
(b) Estimated Logistic Response, n = 200, d = 200
5 10 15 20 25 30 35 40
0
0.2
0.4
0.6
0.8
1
k
(c) Estimated Logistic Response, n = 20000, d = 200
Figure 5: Sharpness of Phase Transition. The top panel shows the behavior of the regression
coecient
1
of (5.11) for d = 200 and n = 200 . . . 20000. The bottom panels show two instances
of the tted transition model, for (b) d = 200, n = 200 and (c) d = 200, n = 20000.
transition and the bound d/(2 log(n)), even for very modest problem sizes. We now turn to
examine the sharpness of the phase transition as the problem dimensions increase. In other
words, we ask whether for large d, n, the width of the transition band becomes narrower, i.e.

n
in (1.2) becomes smaller as n increases. We note that the regression coecient
1
in (5.11)
associated with the predictor variable k dictates how sharp the transition from
d,n
k
= 1 to

d,n
k
= 0 is; a large negative value for
1
implies a step function behavior. Hence, we expect

1
to grow in magnitude as the problem size increases. This is indeed veried in panel (a) of
Figure 5, which displays the behavior of
1
for increasing n, and d xed at 200. For illustration,
panels (b) and (c) show the observed mean response Y
d,n
k
vs. k for n = 200, with
1
0.35,
and n = 20000, with
1
1.1, respectively. Clearly, the phase transition in panel (c) is much
sharper.
5.3 Correct Term Selection and Sign Agreement
In Section 4, we have identied two fundamental properties of the Homotopy solution path,
namely, correct term selection and sign agreement, that constitute necessary and sucient con-
ditions for the k-step solution property to hold. We now present the results of a simulation
study identifying phase transitions in these cases.
In detail, we repeated the experiment described in the previous section, generating problem
instances from the suite o(USE,Uniform; d, n, k) and applying the Homotopy algorithm. As
the outcome of the response variable Y
d,n
k
, we recorded rst the property The Homotopy
23
Figure 6: Phase diagrams for (a) Correct term selection property; (b) Sign agreement property,
on a k-n grid, with k = 1 . . . 40, d = 200, and n = 200 . . . 20000. Overlaid are curves for

k
d,n
0.25
(dashed), and the condence bound for

k
d,n
0.25
with 95% condence level (dotted). Panel (a) also
displays the curve d/(

2 log(n)) (solid), and panel (b) displays the curve d/(2 log(n)) (solid).
The transition on the right-hand display occurs signicantly above the transition in the left-hand
display. Hence the correct term selection property is the critical one for overall success.
algorithm selected only terms from among 1, ..., k as entries to the active set , and next, the
property At each iteration of the Homotopy algorithm, the signs of the residual correlations
on the active set match the signs of the corresponding solution terms . Finally, we estimated
phase transitions for each of these properties. Results are depicted in panels (a) and (b) of
Figure 6. Panel (a) displays a phase diagram for the correct term selection property, and Panel
(b) has a similar phase diagram for the sign agreement property.
The results are quite instructive. The phase transition for the correct term selection property
agrees well with the formula d/(2 log(n)). Rather surprisingly, the phase transition for the sign
agreement property seems to be at a signicantly higher threshold, d/(

2log(n)). Consequently,
the weak link for the Homotopy k-step solution property, i.e. the attribute that dictates the
sparsity bound for the k-step solution property to hold, is the addition of incorrect terms into the
active set. We interpret this nding as saying that the Homotopy algorithm would typically
err rst on the safe side, adding terms o the support of
0
into the active set (a.k.a false
discoveries), before removing correct terms from the active set (a.k.a missed detections).
5.4 Random Signs Ensemble
Matrices in the Random Signs Ensemble (RSE), are constrained to have elements of constant
amplitude, with signs independently drawn from an equi-probable Bernoulli distribution. This
ensemble is known to be eective in various geometric problems associated with underdetemined
systems, through work of Kashin [36], followed by Garnaev and Gluskin [28]. Previous work
[55, 56, 54] gave theoretical and empirical evidence to the fact that many results developed in
the USE case hold when the matrices considered are drawn from the RSE. Indeed, as we now
demonstrate, the RSE shares the k-step solution property with USE.
To show this, we conducted a series of simulations, paralleling the study described in Section
5.2. In our study, we replaced the problem suite o(USE; d, n, k) by o(RSE; d, n, k). Just as
above, we estimated

k
d,n
in two cases: First, xing d = 200, varying n through 40 log-equispaced
24
Figure 7: k-Step Termination of Homotopy for the problem suite o(RSE,Uniform; d, n, k).
Shaded attribute is the proportion of successful terminations, out of 100 trials, in the case: (a)
k = 1 . . . 40, d = 200, and n = 200 . . . 20000; (b) k = 2 . . . 100, d = 100 . . . 1000, and n = 1000.
Overlaid are curves for d/(2 log(n)) (solid),

k
d,n
0.25
(dashed), and the condence bound for

k
d,n
0.25
with 95% condence level (dotted).
points in [200, 20000] and k through 40 equispaced points in [1, 40]; second, keeping n xed at
1000, varying d through 40 equispaced points in [100, 1000] and k through 40 equispaced points
in [2, 100]. The resulting phase diagrams appear in panels (a),(b) of Figure 7. Careful inspection
of the results indicates that the empirical phase transition for RSE agrees well with the formula
d/(2 log(n)) that describes the behavior in the USE case. This is further veried by examining
Table 4, which has MSE and R
2
measures for the deviation of the empirical phase transition
from the threshold d/(2 log(n)), for dierent coecient ensembles. In summary, these empirical
results give strong evidence to support the conclusion that the Homotopy algorithm has the
k-step solution property when applied to the the problem suite o(RSE; d, n, k) as well. An
important implication of this conclusion is that in practice, we may use matrices drawn from the
RSE to replace matrices from the USE without loss of performance. This may result in reduced
complexity and increased eciency, as matrices composed of random signs can be generated and
applied much more rapidly than general matrices composed of real numbers.
Coecient Ensemble MSE R
2
Uniform 0.12 0.98
Gauss 0.11 0.98
Bernoulli 0.08 0.99
Table 4: Deviation of estimated

k
d,n

from d/(2 log(n)) for o(RSE,V; d, n, k), for dierent coef-


cient ensembles V .
6 Fast Solution with Partial Orthogonal Ensembles
We now turn our attention to a class of matrices composed of partial orthogonal matrices, i.e.
matrices constructed by sampling a subset of the rows of an orthogonal matrix. In particular,
we will be interested in the following three ensembles:
25
Partial Fourier Ensemble (PFE). Matrices in this ensemble are obtained by taking an n
by n Fourier matrix and sampling d rows at random. The PFE is of utmost importance
in medical imaging and spectroscopy, as it can be used to model reconstruction of image
data from partial information in the frequency domain [7, 6, 37, 38].
Partial Hadamard Ensemble (PHE). Matrices in this ensemble are obtained by taking an
n by n Hadamard matrix and sampling d of its rows at random. This ensemble is known
to be important for various extremal problems in experimental design [33].
Uniform Random Projections (URPE). Matrices in this ensemble are obtained by taking
an n by n random orthogonal matrix and sampling d of its rows at random. It serves as
a general model for the PFE and PHE.
As we now demonstrate, the k-step solution property of Homotopy holds at a signicantly
greater range of signal sparsities than observed in Section5.
We rst consider the PFE and PHE. Partial Fourier matrices and partial Hadamard matrices
have drawn considerable attention in the sparse approximation community [7, 6, 56, 55, 29]. We
mention two incentives:
Modeling of Physical Phenomena. Fourier operators play a key role in numerous
engineering applications modeling physical phenomena. For instance, acquisition of 2- and
3-D Magnetic Resonance images is modeled as applying a Fourier operator to spatial data.
Thus, application of a partial Fourier operator would correspond to acquisition of partial
frequency data. Likewise, the Hadamard operator, albeit not as common, is often used in
Hadamard Transform Spectroscopy.
Fast Implementations. Both the Fourier transform and the Hadamard transform have
special structure allowing for rapid computation. They can be computed by algorithms
which run much faster than the nave implementation from denition. In detail, the
Fast Fourier Transform (FFT) can be applied to an n-vector using O(nlog(n)) opera-
tions, rather than the n
2
operations needed by direct computation [3]. Similarly, a Fast
Hadamard Transform (FHT) exists, which applies a Hadamard operator using O(nlog(n))
operations. With these fast algorithms, rapid computation of partial Fourier or partial
Hadamard products is straightforward. Specically, note that we may write a d by n
matrix A from the PFE as A = R F, with F an n by n Fourier matrix, and R a d by
n random sampling matrix. Translating to linear operations, application of A amounts to
applying F, using the FFT, followed by sampling d entries of the resulting vector. Similarly
for partial Hadamard matrices.
The experimental setup leading to Empirical Finding 2 utilizes a more extensive phase di-
agram. Expressly, we x n and generate a grid of points indexed by variables = d/n and
= k/d, with in (0, 1], in (0, 1]. Thus, at a xed , d = n|, and k = d| ranges between
1 and d. In a sense, this - grid gives a complete picture of all sparse underdetermined settings
of interest. We considered two discretizations. First, we xed n = 1024, varied in the range
[0.1, 1], and in [0.05, 1]. Second, we xed = 0.5, varied n in [256, 1024], and in [0.05, 1].
Data collection and estimation of an empirical phase transition were done in the same manner
described in Section 5.2. Figures 8 and 9 summarize the simulation results, each depicting three
phase diagrams, for three dierent nonzero distributions.
The results are indeed surprising; for problems from the suite o(PFE; d, n, k), Homotopy
maintains the k-step solution property at a much broader range of sparsities k than observed
26
Figure 8: k-Step Solution Property of Homotopy for the Suite o(PFE,V; d, n, k). Each phase
diagram presents success rates on a - grid, with = k/d in [0.05, 1], = d/n in [0.1, 1], and
n = 1024, for: (a) V = Uniform; (b) V = Gauss; (c) V = Bernoulli. Overlaid are curves
for
0.25
() (dash-dotted), with the condence bounds for 95% condence level (dotted), and
= 1/(2 log(n)) (solid). The range of sparsities in which the k-step solution property holds is
much more extensive than implied by the bound k < d/(2 log(n)).
Figure 9: k-Step Solution Property of Homotopy for the Suite o(PFE,V; d, n, k). Each phase
diagram presents success rates on a -n grid, with = k/d in [0.05, 1], n in [128, 8192], and
d = 125, for: (a) V = Uniform; (b) V = Gauss; (c) V = Bernoulli. Overlaid are curves
for
0.25
() (solid), with the condence bounds for 95% condence level (dotted).
for problems from o(USE; d, n, k), particularly at high values of the indeterminacy ratio .
For instance, at = 3/4, the empirical results indicate that when the nonzeros are uni-
formly distributed, the empirical phase transition is at

k = d 174, whereas the formula
d/(2 log(n)) 55, implying that the range of sparsities for which Homotopy terminates in k
steps has grown threefold.
We conducted parallel simulation studies for the PHE and the URPE, and we summarize
the empirical ndings in Figure 10. It displays the empirical phase transitions () for the three
partial orthogonal ensembles with dierent nonzero distributions. Careful inspection of the
curves reveals that the k-step property of Homotopy holds at roughly the same region of (, )
space for all combinations of matrix ensembles and coecient distributions considered, with mild
variations in the location and shape of the phase transition for dierent nonzero distributions.
In particular, the data indicate that problems with underlying nonzero distribution Bernoulli
exhibit the lowest k-step breakdown. Thus, in some sense, the hardest problems are those with
uniform-amplitude nonzeros in the solution.
The empirical phase transition () does not yield itself to a simple representation in func-
tional form. It does, however, have several important characteristics, which we now highlight.
27
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
= d/n

=
k
/d
(a) Uniform nonzero distribution
PFE
PHE
URPE
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
= d/n

=
k
/d
(b) Gaussian nonzero distribution
PFE
PHE
URPE
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
= d/n

=
k
/d
(c) Bernoulli nonzero distribution
PFE
PHE
URPE
Figure 10: Estimated Phase Transition Curves. Each panel displays empirical phase transition
curves for the PFE (solid), PHE (dashed), URPE (dash-dotted), with (a) Uniform nonzero
distribution; (b) Gaussian nonzero distribution; (c) Bernoulli nonzero distribution.
First, notice that at the right end of the phase diagram, we have ( = 1) = 1. In other
words, when the matrix A is square, Homotopy always recovers the sparsest solution in
k steps. This is hardly surprising, since, at = 1, the matrix A is orthogonal, and the
data y is orthogonal to columns of A o the support of
0
, ensuring correct term selection.
In addition, the update step d

(I) in (2.7) is equal to sgn(c

(I)), implying that the sign


agreement property holds. Thus, k-step termination is guaranteed.
Second, at the left end of the phase diagram, as 0, we may rely on recent theoret-
ical developments to gauge the behavior of the phase transition curve. In their work on
randomly projected polytopes, Donoho and Tanner showed that an upper bound for the
value () below which
1
minimization recovers the sparsest solution with high proba-
bility, asymptotically behaves like 1/(2 log(1/)) as 0 [22]. In particular, this same
upper bound holds in our case, as failure to recover the sparsest solution necessarily implies
failure of the k-step property.
Lastly, and most importantly, we note that the k-step phase transition for the partial
orthogonal ensembles is stable with increasing n, in the following sense. For (0, 1]
xed, the range of in which the k-step property holds does not change with n. Evidence
to the stability of the phase transition is given in Figure 9, which demonstrates that for
xed at 0.5, the phase transition occurs at an approximately constant value of . The
implication of this property is that the empirical phase transitions computed here may be
used to estimate the k-step behavior of Homotopy for problem instances of any size.
7 Bridging
1
Minimization and OMP
In the introduction, we alluded to the fact that there is undeniable parallelism in results about
the ability of
1
minimization and Omp to recover sparse solutions to systems of underdetermined
linear equations. Indeed, earlier reports [9, 19, 15, 52, 53, 54] documented instances where, under
similar conditions, both
1
minimization and Omp recover the sparsest solution. Yet, these works
did not oer any insights as to why the two seemingly disparate techniques should oer similar
performance in such cases. This section develops a linkage between
1
minimization and Omp.
As a rst step, we clear up some confusion in terminology. Previous work studying the
performance of Omp set conditions on the sparsity k under which the algorithm correctly
recovers the sparsest solution [19, 52, 54]. Yet, in eect, what was proven is that Omp selects
28
a correct term at each iteration, and terminates with the correct solution after k iterations; we
now recognize this as the k-step solution property. We note that the k-step solution property
is not, in general, a necessary condition for Omp to correctly recover the sparsest solution; it
is clearly sucient, however. Its usefulness lies in the simplicity of k-step solutions; when the
k-step property holds, the solution path is the simplest possible, consisting of k correct steps.
Thus, it yields itself naturally to focused analysis.
We now argue that, through a series of simplications, we can move from
1
minimization
to Omp, at each step maintaining the k-step solution property. This was already alluded to in
Figure 1, which shows the sequence of steps leading from
1
minimization to Omp; the two links
in the sequence are Homotopy and Lars. In the following subsections, we elaborate on each
of the bridging elements.
7.1
1
Minimization Homotopy
For the phrase algorithm / has the k-step solution property to make sense, the algorithm
must have a stepwise structure, building a solution approximation term by term. Thus, we
cannot, in general, speak of
1
minimization as having the k-step solution property, as there are
many algorithms for such minimization, including those with no meaningful notion of stepwise
approximate solution construction. This is where the Homotopy algorithm ts in, bridging
the high-level notion of
1
minimization with the lower-level stepwise structure of Omp. Earlier
work [24, 40] has shown that Homotopy is a correct algorithm for solving the
1
minimization
problem (P
1
). In addition it builds an approximate solution in a stepwise fashion, thus making
the k-step solution property applicable.
7.2 Homotopy LARS
As noted earlier, the Lars procedure is a simplication of the Homotopy algorithm, achieved by
removing the condition for sign agreement of the current solution and the residual correlations.
This brings us one step closer to Omp; while Homotopy allows for removal of terms from the
active set, both Lars and Omp insert terms, one by one, into the active set, never removing
any active elements.
Interestingly, when the k-step property holds, this simplication changes nothing in the
resulting solution. To see that, recall that Homotopy has the k-step solution property if and
only if it has the correct term selection property and the sign agreement property. Correct term
selection is all that is needed by Lars to reach the solution in k steps; after k correct terms
are inserted into the active set, the algorithm would terminate with zero residual. The sign
agreement property implies that Homotopy would successfully terminate in k steps as well. In
summary, when the k-step solution property holds, the solution paths of Homotopy and Lars
coincide. In particular, the assertion of Theorem 1 holds for Lars as well.
Corollary 2 Let (A, y) be a problem instance drawn from o(Inc

; d, n, k), with k (
1
+1)/2.
Then, Lars runs k steps and stops, delivering the solution
0
.
Similar conclusions can be made about the assertions of Empirical Findings 1 and 2.
7.3 LARS OMP
We already hinted at the fact that Homotopy and Omp share an inherent anity. We now
demonstrate that the two methods are based on the same fundamental procedure; solving a
29
sequence of least-squares problems on an increasingly large subspace, dened by the active
columns of A.
More precisely, assume we are at the -th iteration, with a solution estimate x
1
and an
active set I consisting of 1 terms. Recall the procedure underlying Omp: It appends a new
term to the active set by selecting the vector a
j
that maximizes the absolute residual correlation;
j = arg max
j / I
[a
T
j
(y Ax
1
)[, and I := I

j. It then projects y onto 1(A


I
) by solving
A
T
I
A
I
x

(I) = A
T
I
y, (7.12)
thus guaranteeing that the residual is orthogonal to the subspace 1(A
I
). Once y is spanned by
the active columns, the algorithm terminates with zero residual.
While Lars was derived from a seemingly disparate perspective, it in fact follows the same
basic procedure, replacing (7.12) with a penalized least-squares problem. Recalling the dis-
cussion in Section 2, we note that Lars also selects a new term according to the maximum
correlation criterion, and then solves
A
T
I
A
I
x

(I) = A
T
I
y

s, (7.13)
where s is a vector of length , recording the signs of residual correlations of each term at the
point it entered the active set, and

is the correlation magnitude at the breakpoint on the


Lars path. Indeed, (7.12) and (7.13) are identical if not for the penalization term on the right
hand side of (7.13).
As it turns out, this modication does not have an eect on the k-step solution property of
the procedure. We oer two examples. First, for the incoherent problem suite o(Inc

; d, n, k);
the statement of theorem 1 hold for Omp as well (see, e.g., Corollary 3.6 in [52]). Second,
for the random problem suite o(USE; d, n, k); Tropp et al. proved results which can now be
reinterpreted as saying that, when k d/(c log(n)), then with high probability for large n, Omp
has the k-step solution property, paralleling the statement of Empirical Finding 1. We believe,
in fact, that the bound k d/(2 log(n)) holds for Omp as well.
To summarize, we exhibited a series of transformations, starting with
1
minimization, and
ending with greedy pursuit. Each transformation is characterized clearly, and maintains the
k-step property of the solution path. We believe this sequence claries initially surprising simi-
larities between results about
1
minimization and Omp.
8 Homotopy and PFP
Polytope Faces Pursuit was introduced as a greedy algorithm to solve (P
1
) in the dual space. As
pointed out in [43, 42], although the algorithm is derived from a seemingly dierent viewpoint,
it exhibits interesting similarities to Homotopy. To study these in more detail, we rst briey
describe the algorithm.
Dene the standard-form linear program equivalent to (P
1
)
(

P
1
) min1
T
x subject to y =

A x, x 0,
where

A = [A A]. The equivalence of (P
1
) and (

P
1
) is well established; the solutions x of
(

P
1
) and x of (P
1
) have the correspondence x(j) = maxx(j), 0 for 1 j n, and x(j) =
maxx(j n), 0 for n+1 j 2n. The Lagrangian dual of the linear program (

P
1
) is given
by
(

D
1
) max y
T
v subject to

A
T
v 1.
30
By the strong duality theorem, an optimal solution x of the primal problem (

P
1
) is associated
with an optimal solution v to the dual problem (

D
1
), such that 1
T
x = y
T
v. Pfp operates in
the dual space, tracing a solution to the dual problem (

D
1
), which then yields a solution to the
primal problem (

P
1
) (or, equivalently, (P
1
)). To do so, the algorithm maintains an active set,
I, of nonzero coordinates, and carries out a familiar procedure at each iteration: Selection of
a new term to enter the active set, removal of violating terms from the active set (if any), and
computation of a new solution estimate. Specically, At the -th stage, Pfp rst selects a new
term to enter the active set, via the criterion
i

= arg max
i / I

a
T
i
r

1 a
T
i
v

a
T
i
r

> 0

, (8.14)
with r

= y

A x

the current residual. It then updates the active set, I I a


i
, and computes
a primal solution estimate, the projection of y onto the subspace spanned by the active vectors,
x
+1
= (

A
T
I

A
I
)
1

A
T
I
y, (8.15)
and a dual solution estimate
v
+1
=

A
I
(

A
T
I

A
I
)
1
1. (8.16)
Finally, the algorithm veries that all the active coecients x
+1
(I) are non-negative; coordinates
with negative coecients are removed from the active set.
The anities between Homotopy and Pfp are not merely cosmetic. In fact, the two
algorithms, derived from seemingly distinct viewpoints, carry out nearly identical procedures.
To highlight these similarities, assume that both algorithms are applied to the standard form
linear program (

P
1
). We begin by examining the criteria for selection of a new term to enter the
active set. In particular, assume we are at the -th stage of the Homotopy algorithm. Recalling
the discussion in Section 2 above, the Homotopy term selection criterion (when applied to (

P
1
))
is
i
+
= arg min
iI
c
+

(i)
1 a
T
i
v

, (8.17)
where v

=

A
I
d

(I) =

A
I
(

A
T
I

A
I
)
1
1 as in (8.16) above, and min
+
indicates that the minimum is
taken over positive arguments. Comparing (8.17) with (8.14), we note that the main dierence
lies in the residual; since Pfp projects the data y onto the space spanned by the active vectors,
denoted 1
I
, its residual is necessarily orthogonal to 1
I
. Homotopy, in contrast, does not
enforce orthogonality, and its residual can be expressed as the sum of a component in 1
I
and a
component in the orthogonal complement 1

I
. Formally,
r

= P
I
r

+ r

,
where P
I
=

A
I
(

A
T
I

A
I
)
1

A
T
I
is the orthogonal projector onto 1
I
, and r

= (I
d
P
I
)r

is the
component of r

in 1

I
. The component of r

in 1
I
obeys
P
I
r

=

A
I
(

A
T
I

A
I
)
1
c

= v

. (8.18)
Plugging this into (8.17), we get
i
+
= arg min
iI
c
+

a
T
i
(r

+ v

)
1 a
T
i
v

= arg min
iI
c
+


a
T
i
r

1 a
T
i
v

= arg max
iI
c

a
T
i
r

1 a
T
i
v

a
T
i
r

> 0

,
31
Figure 11: Phase Diagrams of Pfp for the Problem Suite o(USE,Uniform; d, n, k). Shaded
attribute is the proportion of successful executions among total trials, for: (a) k-step solution
property; (b) correct term selection property; (c) sign agreement property. Overlaid on each
plot are curves for d/(2 log(n)) (solid),

k
d,n
0.25
(dashed), and the condence bound for

k
d,n
0.25
with
95% condence level (dotted).
where the last equality holds because

A
T
v

< 1 throughout the Homotopy solution path. Thus,


conditional on r
PFP

= r

, the Homotopy selection criterion is equivalent to Pfps selection


criterion. Now, since r

= y P
I
y, as long as no elements are removed from the active set,
r
PFP

= r

by denition. In summary, the discussion above establishes the following.


Corollary 3 As long as no elements are removed from the active set, the solution path of Pfp
is identical to the Homotopy solution path. In particular, with the sign condition removed,
Pfp is equivalent to Lars.
This nding is interesting in light of the earlier discussion; it implies that when the k-step
solution property holds for Homotopy, it also holds for Pfp. In particular, the statement of
Theorem 1 holds for Pfp as well. Formally,
Corollary 4 Let (A, y) be a problem instance drawn from o(Inc

; d, n, k), with k (
1
+1)/2.
Then, Pfp runs k steps and stops, delivering the solution
0
.
Similarly, Empirical Findings 1 and 2 hold for Pfp as well. Figure 11 displays phase dia-
grams for the k-step solution property (Panel (a)), correct term selection property (Panel (b)),
and sign agreement property (Panel (c)), when Pfp is applied to instances of the problem
suite o(USE,Uniform; d, n, k). The three properties exhibit clear phase transition at around
d/(2 log(n)). Analogously, the statements in Empirical Findings 2 can be veried to hold for
Pfp as well; we do not bring such evidence here for brevity of space.
9 Recovery of Sparse Solutions
Previous sections examined various properties of Homotopy, all centered around the k-step
solution property. The motivation is clear; The Homotopy algorithm is at peak eciency
when it satises the k-step solution property. Yet, we noted that failure to terminate in k steps
need not imply failure to recover the sparsest solution. We now turn to study this latter property
in more detail.
Denition 5 An algorithm / has the k-sparse Solution Property at a given problem instance
(A, y) drawn from o(E,V; d, n, k) if, when applied to the data (A, y), the algorithm terminates
with the solution
0
.
32
Earlier papers used a dierent term for the same concept [18, 15, 55]; they would use the term

0
equivalence to indicate recovery of the sparsest solution, i.e. equivalence to the solution of
the combinatorial optimization problem
(P
0
) min
x
|x|
0
subject to y = Ax.
For the Homotopy algorithm, the k-sparse solution property has been well-studied. Indeed,
since the Homotopy algorithm is guaranteed to solve (P
1
), it will have the k-sparse solution
property whenever the solutions to (P
0
) and (P
1
) coincide. In a series of papers [15, 13, 17],
Donoho has shown that for problem instances from the suite o(USE; d, n, k), the equivalence
between solutions of (P
0
) and (P
1
) exhibits a clear phase transition, denoted
W
, that can be
computed numerically. This phase transition provides a theoretical prediction for the condi-
tions under which Homotopy satises the k-sparse solution property. Yet, at present, no such
predictions exist for Lars, Omp, or Pfp.
We now present the results of an extensive simulation study, examining the validity of the
k-sparse solution property for Homotopy, Lars, Omp, and Pfp, comparing the performance of
these four closely related algorithms. The study we conduct closely parallels the experiments de-
scribed above. In particular, we measure the k-sparse solution property for the three algorithms,
applied to problems instances from the suites o(USE,Uniform; d, n, k), o(USE,Gauss; d, n, k)
and o(USE,Bernoulli; d, n, k). We generated 100 independent trials at each point on the pa-
rameter grid, indexed by = d/n and = k/d, with ranging through 40 equispaced points in
[0.1, 1], ranging through 40 equispaced points in [0.05, 1], and n xed at 1000. The resulting
phase diagrams appear in Figure 12. Parallel results summarizing a similar experiment done for
the URPE are presented in Figure 13. Preliminary inspection of these phase diagrams reveals
that the suggestive similarities between the three algorithms indeed translate into comparable
performance in recovering the sparsest solution. More careful examination commands several
important observations.
Similarity of Lars and Homotopy. Throughout all the displayed phase diagrams, the
empirical phase transition of Lars closely follows the transition of Homotopy. In other
words, almost anytime
1
minimization successfully recovers the sparsest solution, Lars
succeeds in nding it as well. This is a rather surprising nd, as, at least in the USE case,
for k > d/(2 log(n)), Lars is bound to err and add elements not in the support of
0
into its active set. The evidence we present here suggests that inside the
1
-
0
equivalence
phase, Lars still correctly recovers the support of
0
and obtains the sparsest solution.
Universality of Lars and Homotopy. Examining Figure 12, we note that for the
suite o(USE; d, n, k), the performance of Lars and Homotopy is independent of the
underlying distribution of the nonzero coecients. Universality does not hold for Omp,
whose performance is aected by the distribution of the nonzeros. In particular, Omps
worst performance is recorded when the underlying nonzeros have constant amplitude and
a random sign pattern; this has been noted previously in the analysis of Omp [15, 52].
Increased Performance with o(URPE; d, n, k). Comparison of Figure 13 with Figure
12 reveals that, in general, all four algorithms perform better when applied to problem
instances from the suite o(URPE; d, n, k). Universality, however, no longer holds for this
problem suite. In particular, Homotopy and Lars show a marked increase in performance
when applied to the suite o(URPE,Uniform; d, n, k). In our experiments, we observed
similar phenomena when applying the three algorithms to problems drawn from the suites
33
o(PFE; d, n, k) and o(PHE; d, n, k) as well. This is indeed an important nding, as these
suites are used extensively in applications.
10 Stylized Applications
Earlier, we gave theoretical and empirical evidence showing that in certain problem scenarios,
when sucient sparsity is present, the Homotopy path and the Lars path agree. We now
present examples inspired by concrete applications in NMR spectroscopy and MR imaging, that
demonstrate the performance of the two algorithms in solving sparse approximation problems.
In accord with earlier sections, we shall compare and contrast the results with those obtained
by Omp.
10.1 Spectroscopy
Nuclear Magnetic Resonance (NMR) Spectroscopy is a widely used method of structure de-
termination in modern chemistry; see [35]. In its essence, NMR spectroscopy is based on the
behavior of atomic nuclei in a magnetic eld. Thus, an NMR spectrometer records the resonance
frequencies of nuclei when exposed to a strong magnetic eld and irradiated with an RF pulse.
The resulting signal is a superposition of all excited frequencies. For an caricature of 1-D NMR
spectra, consider the signal Bumps, depicted in Figure 14(a). It is a synthetic signal, made up of
a few peaks at varying amplitudes. Figure 14(b) shows its wavelet expansion on a scale-by-scale
basis. Clearly, Bumps has relatively few signicant wavelet coecients, and it admits a sparse
representation in a wavelet basis.
As a rst example, we consider reconstruction of NMR spectra from partial frequency infor-
mation [46, 47]. In detail, let x be an NMR signal of length n, which admits sparse synthesis in
a wavelet basis, i.e. = Wx is sparse, with W a wavelet analysis operator. Let be a d n
matrix drawn from the partial Fourier ensemble, i.e. constructed by selecting d rows of an nn
Fourier matrix at random. At the sampling stage, we compute y = x; In words, y is computed
by sampling d frequencies of x at random. At the reconstruction stage, we solve:
(CS
1
) min||
1
subject to y = W
T
,
where we used the fact that W is an orthogonal operator, whence x = W
T
. Readers familiar
with the notion of Compressed Sensing (CS) [14, 56] will recognize (CS
1
) as a particular case of
a Compressed Sensing problem. Indeed, work by Donoho [14] has shown that when the signal
x is compressible (i.e. has a sparse representation in a certain basis), the solution of (CS
1
) is
faithful to the original signal x.
For the purposes of this simulation, the signal Bumps was digitized to n = 1024 samples. We
measured d = 512 random frequency measurements, and applied the Homotopy algorithm to
solve (CS
1
), as well as Lars, Omp, and Pfp. Figure 15 summarizes the results at several stages
of execution, and Table 5 has corresponding error and timing measures. The results indicate
that all four algorithms faithfully recovered the signal, at roughly the same error. Of the four
algorithms, Omp was fastest, though only marginally, compared to Homotopy and Lars. We
also note that, owing to the rapid coecient decay of the wavelet expansion of Bumps, already
after 200 iterations, we obtain accurate reconstructions, at a fraction of the time it takes the
execution to complete. This suggests that, in practice, if we have knowledge of the sparsity
structure of the solution, we may enforce early termination without compromising the quality
of reconstruction. As the example in the next section illustrates, the savings are particularly
signicant in large-scale applications.
34
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
= d/n


=

k
/
d
(a) Nonzero coefficient distribution U[0,1]
Homotopy
LARS
OMP
PFP

W
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
= d/n


=

k
/
d
(b) Nonzero coefficient distribution N[0,1]
Homotopy
LARS
OMP
PFP

W
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
= d/n


=

k
/
d
(c) Nonzero coefficient distribution random +/1
Homotopy
LARS
OMP
PFP

W
Figure 12: The
0
Equivalence Property for the Suite o(USE,V; d, n, k). Each panel has plots
of the empirical phase transition curves of Homotopy (dashed), Lars (dash-dotted), Omp
(dotted) and Pfp (solid) alongside the theoretical curve
W
, on a - grid, with n = 1000, for:
(a) V = Uniform; (b) V = Gauss; (c) V = Bernoulli.
35
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
= d/n


=

k
/
d
(a) Nonzero coefficient distribution U[0,1]
Homotopy
LARS
OMP
PFP

W
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
= d/n


=

k
/
d
(b) Nonzero coefficient distribution N[0,1]
Homotopy
LARS
OMP
PFP

W
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
= d/n


=

k
/
d
(c) Nonzero coefficient distribution random +/1
Homotopy
LARS
OMP
PFP

W
Figure 13: The
0
Equivalence Property for the Suite o(URPE,V; d, n, k). Each panel has plots
of the empirical phase transition curves of Homotopy (dashed), Lars (dash-dotted), Omp
(dotted) and Pfp (solid) alongside the theoretical curve
W
, on a - grid, with n = 1000, for:
(a) V = Uniform; (b) V = Gauss; (c) V = Bernoulli.
36
100 200 300 400 500 600 700 800 900 1000
1
0
1
2
3
4
5
(a) Signal Bumps, n = 1024
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
10
8
6
4
2
(b) Wavelet analysis
Figure 14: (a) Signal Bumps, with n = 1024 samples; (b) Its wavelet expansion.
200 400 600 800 1000
0
2
4
Homotopy solution at 100 iters.
200 400 600 800 1000
0
2
4
LARS solution at 100 iters.
200 400 600 800 1000
0
2
4
OMP solution at 100 iters.
200 400 600 800 1000
0
2
4
PFP solution at 100 iters.
200 400 600 800 1000
0
2
4
Homotopy solution at 200 iters.
200 400 600 800 1000
0
2
4
LARS solution at 200 iters.
200 400 600 800 1000
0
2
4
OMP solution at 200 iters.
200 400 600 800 1000
0
2
4
PFP solution at 200 iters.
200 400 600 800 1000
0
2
4
Final Homotopy solution
200 400 600 800 1000
0
2
4
Final LARS solution
200 400 600 800 1000
0
2
4
Final OMP solution
200 400 600 800 1000
0
2
4
Final PFP solution
Figure 15: CS reconstruction of Bumps: Left to right: Solutions of Homotopy, Lars, Omp,
and Pfp; Top to bottom: Solutions after 100 iterations, 200 iterations, and at termination.
10.2 MR Imaging
Magnetic Resonance Imaging (MRI) is a technique based on the same principles as NMR spec-
troscopy. It is used primarily in medical settings to produce high quality images of internal
organs in the human body. Unlike NMR, here we are interested in the spatial location of pro-
tons exposed to a magnetic eld. Acquisition of an MR image involves adapting the gradients
of the external magnetic eld to get a spatial frequency map of the exposed organ. As MRI
37
Algorithm After 100 iters. After 200 iters. Final result
relerr
2
Time relerr
2
Time relerr
2
Time
Homotopy 0.280 0.6 secs 0.105 1.5 secs 0.052 12.4 secs
Lars 0.280 0.5 secs 0.105 1.5 secs 0.049 10.7 secs
Omp 0.120 0.4 secs 0.021 1.3 secs 0.017 7.0 secs
Pfp 0.154 1.5 secs 0.053 3.7 secs 0.050 19.5 secs
Table 5: Error measures and execution times for CS reconstruction of Bumps with Homotopy,
Lars and Omp.
often deals with enormous volumes of data, acquisition time is a key factor in the success of this
technique. Thus, the ability to reconstruct a faithful image from partial frequency information
is of great benet.
The scenario we consider in this section parallels the spectroscopy example, with some no-
table dierences. Most importantly, rather than acquiring 1-D spectra, here we need to map
a 3-D volume corresponding to a certain part of the human body. This implies that a huge
mass of data needs to be sampled and subsequently reconstructed. Thus, speed becomes a
crucial factor; computationally-intensive methods which are acceptable choices for 1-D recon-
struction are rendered useless in the 3-D case. In an attempt to reduce acquisition time and
speed up computations, traditional methods employ frequency sampling on coarser grids, or only
at low frequencies; the results are often blurred or aliased. Here we shall follow the Compressed
Sensing paradigm and sample Fourier space at random, with subsequent reconstruction via
1
minimization.
In detail, we treat the 3-D volume as an array of 2-D slices. Let x
j
denote one such slice,
column-stacked into a vector of length n. Just as we did in the 1-D case, at the sampling stage,
we collect data y
j
= x
j
, where is a dn matrix drawn from the partial Fourier ensemble. In
words, we sample d frequencies of x
j
at random. Note that in practice, this frequency sampling
is done physically by the MRI scanner; this operation is simulated here with a multiplication
by . At the reconstruction stage, we solve (CS
1
), with the wavelet operator W replaced by its
2-D counterpart. We do so for each slice x
j
, until the entire 3-D volume is reconstructed.
In our simulations, we used real data provided to us courtesy of Neal Bangerter of the
MRSRL Lab at Stanford University, see [1] for more details. It is a 3-D volume of a part of a
human leg, containing 230 slices of dimensions 128128. Figure 16, panels (a),(d) and (g), show
X- and Y-axis projections of the data along with one vertical slice. Clearly, computing the full
Homotopy solution would be prohibitively time-consuming, considering the data size. Hence, to
approximately solve (CS
1
) at the reconstruction stage, we performed only 800 iterations of Lars.
For comparative purposes, we also performed 800 iterations of Omp. The results are depicted
in the two rightmost columns of Figure 16. Indeed, we see that 800 iterations were sucient to
obtain a visually faithful result. In particular, all the major arteries appearing in the original
data are also present in the reconstructed data. Comparison of the reconstructions obtained
by Lars and Omp reveals that the reconstruction result of Omp tends to suer from streaking
artifacts, hindering the readability of the nal result. Thus, qualitatively, Lars produced a
better reconstruction.
10.3 Component Separation in a Noisy Environment
In Section 2, we commented that Homotopy may be easily adapted to treat noisy data, by
terminating at a prescribed point on the Homotopy path. We now illustrate the use of this
38
(a) Xaxis MIP of Original Data (b) Xaxis MIP of LARS Output at 800 Iterations (c) Xaxis MIP of OMP Output at 800 Iterations
(d) Yaxis MIP of Original Data (e) Yaxis MIP of LARS Output at 800 Iterations (f) Yaxis MIP of OMP Output at 800 Iterations
(g) Original Vertical Slice (h) LARS Output at 800 Iterations (i) OMP Output at 800 Iterations
Figure 16: CS reconstruction of 3-D MRI data: Left column: X- and Y-axis projections of the
original data, and one vertical slice; Middle column: corresponding reconstructions with Lars;
Right column: corresponding reconstructions with Omp.
strategy, in the context of component separation. In the particular example we present here,
we consider the problem of separating a signal contaminated with white noise into its har-
monic and impulsive components. This relatively simple problem, explored in [9, 20], inspired
more ambitious applications such as texture-geometry separation [48] and astronomical image
representation [49].
In detail, let y
0
be a signal of length d, composed of a few sinusoids and a few spikes, with a
total of k such components, k d. Note that y may be written as y = A
0
, with A = [I F] a
d 2d matrix representing an overcomplete dictionary composed of the d d identity matrix I,
and the d d Fourier matrix F.
0
is then a 2d coecient vector, with |
0
|
0
= k. In our noisy
setting, y
0
is not directly observable for measurement. Instead, we measure data y = y
0
+z, with
the noise power |z|
2

n
. To solve this underdetermined system for the time and frequency
components, we apply the Homotopy algorithm, stopping when the residual gets below the
noise level
n
.
Figure 17 presents the results of this simulation. Panel (a) displays a signal y
0
of length
d = 512, composed of 2 harmonic terms superposed with 40 spikes, with amplitudes distributed
normally. These are plotted in panels (b) and (c). Panel (d) shows the observed data, buried
in white gaussian noise, with a signal to noise ratio of 3. Thus, the noise power is one third
the signal power. Panels (e)-(h) on the bottom row of Figure 17 show the corresponding output
of the Homotopy algorithm, applied to the data y. As evident from the gure, Homotopy
manages to recover most of the components of
0
, leaving mostly noise in the residual. The
39
0 200 400
2
1.5
1
0.5
0
0.5
1
1.5
2
(a) Original Signal, d = 512
0 200 400
2
1.5
1
0.5
0
0.5
1
1.5
2
(b) Time Component
0 200 400
2
1.5
1
0.5
0
0.5
1
1.5
2
(c) Frequency Component
0 200 400
2
1.5
1
0.5
0
0.5
1
1.5
2
(d) Contaminated Signal, SNR = 3
0 200 400
2
1.5
1
0.5
0
0.5
1
1.5
2
(e) Homotopy Solution
0 200 400
2
1.5
1
0.5
0
0.5
1
1.5
2
(f) Time Component of Solution
0 200 400
2
1.5
1
0.5
0
0.5
1
1.5
2
(g) Frequency Component of Solution
0 200 400
2
1.5
1
0.5
0
0.5
1
1.5
2
(h) Residual
Figure 17: Time-Frequency Separation in the presence of noise: Top row: The original signal, its
time and frequency components, and the noisy signal with SNR = 3. Bottom row: Corresponding
output of Homotopy.
relative reconstruction error is |
0
|
2
/|
0
|
2
= 0.19. In summary, as this example illustrates,
Homotopy may be straightforwardly adapted to deal with noisy data, to successfully recover
sparse solutions of underdetermined systems.
11 Accompanying Software
SparseLab is a library of Matlab functions we make available on the web at http://sparselab.stanford.edu.
It has been developed in the spirit of reproducible research [4, 10], namely the idea that publica-
tion of computational research demands also the publication of the complete software ennviron-
ment needed for reproducing that research. SparseLab has been used by the authors to create
the gures and tables used in this article, and the toolbox contains scripts which will reproduce
all the calculations of this paper. The current release includes about 400 Matlab les, datasets,
and demonstration scripts.
12 On Approximate
1
Minimization Algorithms
In the belief that true
1
minimization is far too slow for large-scale applications, numerous
authors have explored approximate algorithms based on iterative thresholding or similar ideas;
examples include [23, 11, 48, 49, 25]; related ideas include projection onto convex sets [6]. The
reader should keep in mind that there are applications where such methods indeed operate faster
40
than the methods discussed here, while still producing a faithful approximate solution; see, for
example, Section 9 of [23]. On the other hand, the present methods for
1
minimization are
much faster than general purpose optimizers, so much of the apparent advantage claimed for
approximate
1
schemes has been neutralized by the algorithms studied here. Furthermore,
approximate methods inevitably trade o aptitude for speed; in general, the range of sparsities
in which such methods reliably recover the sparsest solution is considerably limited, compared to

1
minimization. A further advantage of the Homotopy viewpoint is the stimulus it provides for
future research: the empirical phenomena identied here may lead to deep research in polytope
theory, perhaps comparable to the kinds of results in [22].
13 Conclusions
In this paper, we study the Homotopy algorithm, introduced by Osborne et al. [40] and Efron
et al. [24]. It was originally proposed in the context of overdetermined noisy systems. We
propose here to use Homotopy to obtain sparse solutions of underdetermined systems of linear
equations. Homotopy follows a piecewise linear solution path, starting from the zero solution,
and terminating at the solution of (P
1
). It maintains an active set of column vectors of the
matrix A, adding or removing a vector at each step, and updating the solution accordingly. It
is a particularly eective algorithm for solving
1
minimization problems when the underlying
solution is sparse, and the underlying system is incoherent or random. In such scenarios, it signif-
icantly outperforms state-of-the-art general LP solvers. Moreover, its computational complexity
is then comparable to greedy stepwise algorithms such as Omp. In fact, the similarity between
the Homotopy, Lars, and Omp algorithms illuminates the parallelism evident in results on
the performance of
1
minimization and Omp in recovering sparse solutions to underdetermined
linear systems.
Such claims we made concrete by formally showing that for two matrix classes, determin-
istic incoherent matrices and random uniform spherical matrices, Homotopy has the follow-
ing k-step solution property: If the data admit sparse representation using k columns of A,
Homotopy nds it in exactly k steps. We proved that for problems drawn from the suite
o(Inc

; d, n, k), Homotopy has the k-step solution property whenever k < (


1
+ 1)/2. We
also observed empirically that for instances of the problem suite o(USE; d, n, k), Homotopy has
the k-step solution property with high probability when k is less than a threshold, empirically
k < d/(2 log(n)) (1
n
).
Using the notions of phase diagrams and phase transitions, we presented empirical evidence
supporting these assertions. We showed that for problem instances drawn from o(USE; d, n, k),
the k-step solution property of Homotopy exhibits an empirical phase transition at d/(2 log(n)),
regardless of the distribution of the nonzeros in the solution. Simulations with the suite
o(RSE; d, n, k) resulted in similar ndings. For the PFE and PHE, we noted that Homo-
topy has the k-step solution property at a much wider range of sparsities, compared with the
USE. Thus, in many applications involving PFE and PHE matrices, Homotopy would be par-
ticularly ecient in nding the sparsest solution. Similar conclusions were made for the problem
suite o(URPE; d, n, k).
Comparison of the performance of Homotopy, Lars, Omp, and Pfp in recovering sparse
solutions led to several interesting ndings. First, we observed that Lars, an approximation to
Homotopy that only adds terms to its active set, performed remarkably similarly to Homo-
topy. Second, both algorithms exhibited an increase in performance when applied to problem
instances from the suites o(URPE; d, n, k), o(PFE; d, n, k) and o(PHE; d, n, k). The perfor-
mance of Omp was comparable in some cases, yet it was highly aected by the underlying
41
distribution of the nonzero coecients. In particular, when the coecients follow a random sign
pattern with constant amplitude, Omps region of successful recovery was signicantly smaller
than that of Homotopy or Lars.
Homotopy extends naturally to deal with noisy data, and produces satisfactory results
even when it is stopped before completing the solution path. Examples inspired by applications
in NMR spectroscopy and MR imaging were given to demonstrate its usefulness in practical
scenarios. These examples and all the simulations in this paper are reproducible with the
accompanying software.
Note
At one point, we made an eort to include Iddo Drori in this project. Drori has since presented
simulation results related to some of our work in Section 5 above in a talk at the 2006 ICASSP
meeting.
References
[1] N.K. Bangerter, B.A. Hargreaves, J.H. Brittain, B. Hu, S.S. Vasanawala, and D.G.
Nishimura. 3d uid-suppressed t2-prep ow-independent angiography using balanced ssfp.
In Proc. of the 12th ISMRM, page 11, 2004.
[2] Michel Berkelaar, Kjell Eikland, and Peter Notebaert. Lp solve: (mixed-integer) linear
programming system. http://lpsolve.sourceforge.net.
[3] William L. Briggs and Van Emden Henson. The DFT: An Owners Manual for the Discrete
Fourier Transform. SIAM, 1995.
[4] Jon Buckheit and David L. Donoho. Wavelab and reproducible research. Wavelets and
Statistics, 1995.
[5] Emmanuel Cand`es and Terence Tao. Decoding by linear programming. IEEE Transactions
on Information Theory, 51(12):42034215, December 2005.
[6] Emmanuel J. Cand`es and Justin Romberg. Practical signal recovery from random projec-
tions. Manuscript, 2005.
[7] Emmanuel J. Cand`es, Justin Romberg, and Terence Tao. Robust uncertainty principles:
Exact signal reconstruction from highly incomplete frequency information. IEEE Transac-
tions on Information Theory, 52(2):489509, February 2006.
[8] Emmanuel J. Cand`es and Terence Tao. Stable signal recovery from incomplete and inac-
curate observations. Comm. Pure Appl. Math., 59(8):12071223, August 2006.
[9] Scott Shaobing Chen, David L. Donoho, and Michael A. Saunders. Atomic decomposition
by basis pursuit. SIAM J. Sci Comp., 20(1):3361, 1999.
[10] Sou Cheng Choi, David L. Donoho, Ana Georgina Flesia, Xiaoming Huo, Ofer Levi, and
Danzhu Shi. Beamlab web site. http://www-stat.stanford.edu/beamlab/.
[11] I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear
inverse problems with a sparsity constraint. Comm. Pure Appl. Math., 57(11):14131541,
August 2004.
42
[12] Geo Davis, St`ephane Mallat, and Marco Avellaneda. Adaptive greedy approximations.
Journal of Constructive Approximation, 13:5798, 1997.
[13] David L. Donoho. Neighborly polytopes and sparse solution of underdetermined linear
equations. Technical report, Dept. of Statistics, Stanford University, 2005.
[14] David L. Donoho. Compressed sensing. IEEE Transactions on Information Theory,
52(4):12891306, April 2006.
[15] David L. Donoho. For most large underdetermined systems of linear equations, the minimal

1
solution is also the sparsest solution. Comm. Pure Appl. Math., 59(7):907934, July 2006.
[16] David L. Donoho. For most underdetermined systems of linear equations, the minimal

1
-norm near-solution approximates the sparsest near-solution. Comm. Pure Appl. Math.,
59(6):797829, June 2006.
[17] David L. Donoho. High-dimensional centrosymmetric polytopes with neighborliness pro-
portional to dimension. Discrete and Computational Geometry, 35(4):617652, May 2006.
[18] David L. Donoho and Michael Elad. Optimally sparse representation from overcomplete
dictionaries via
1
norm minimization. Proc. Natl. Acad. Sci. USA, 100(5):21972002,
March 2003.
[19] David L. Donoho, Michael Elad, and Vladimir N. Temlyakov. Stable recovery of sparse
overcomplete representations in the presence of noise. IEEE Transactions on Information
Theory, 52(1):618, January 2006.
[20] David L. Donoho and Xiaoming Huo. Uncertainty principles and ideal atomic decomposi-
tion. IEEE Trans. Info. Theory, 47(7):28452862, 2001.
[21] David L. Donoho and Benjamin F. Logan. Signal recovery and the large sieve. SIAM J.
Appl. Math., 52(2):577591, 1992.
[22] David L. Donoho and Jared Tanner. Counting faces of randomly-projected polytopes when
the projection radically lowers dimension, 2006.
[23] David L. Donoho, Yaakov Tsaig, Iddo Drori, and Jean-Luc Starck. Sparse solution of
underdetermined linear equations by stagewise orthogonal matching pursuit. Submitted,
2006.
[24] Bradley Efron, Trevor Hastie, Iain M. Johnstone, and Robert Tibshirani. Least angle
regression. The Annals of Statistics, 32(2):407499, 2004.
[25] M. Elad, B. Matalon, and M. Zibulevsky. Coordinate and subspace optimization methods
for linear least squares with non-quadratic regularization. Appl. Comp. Harm. Anal., to
appear.
[26] Michael Elad and Alfred M. Bruckstein. A generalized uncertainty principle and sparse
representations in pairs of bases. IEEE Transactions on Information Theory, 48(9):2558
2567, September 2002.
[27] Jean J. Fuchs. On sparse representations in arbitrary redundant bases. IEEE Transactions
on Information Theory, 50(6):13411344, June 2004.
43
[28] A. Y. Garnaev and E. D. Gluskin. On widths of the euclidean ball. Soviet Mathematics
Doklady, 30:200203, 1984. (in English).
[29] Anna C. Gilbert, S. Guha, P. Indyk, S. Muthukrishnan, and M. Strauss. Near-optimal
sparse fourier representations via sampling. In Proc. 34th ACM symposium on Theory of
Computing, pages 152161. ACM Press, 2002.
[30] Gene H. Golub and Charles F. Van Loan. Matrix computations (3rd ed.). Johns Hopkins
University Press, Baltimore, MD, USA, 1996.
[31] Remi Gribonval and Morten Nielsen. Sparse representations in unions of bases. IEEE
Trans. Info. Theory, 49(12):13201325, 2003.
[32] Remi Gribonval and Morten Nielsen. Highly sparse representations from dictionaries are
unique and independent of the sparseness measure. preprint, 2006.
[33] M. Harwit and N.J.A. Sloane. Hadamard Transform Optics. Academic Press, 1979.
[34] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learn-
ing. Springer-Verlag, 2001.
[35] Jerey C. Hoch and Alan Stern. NMR Data Processing. Wiley-Liss, 1996.
[36] Boris S. Kashin. Diameters of certain nite-dimensional sets in classes of smooth functions.
Izv. Akad. Nauk SSSR, Ser. Mat., 41(2):334351, 1977.
[37] Michael Lustig, Jin H. Lee, David L. Donoho, and John M. Pauly. Faster imaging with
randomly perturbed spirals and
1
reconstruction. In Proc. of the 13th ISMRM, 2004.
[38] Michael Lustig, J.M. Santos, David L. Donoho, and John M. Pauly. Rapid mr angiography
with randomly under-sampled 3dft trajectories and non-linear reconstruction. In Proc. of
the 9th SCMR, 2006.
[39] Dmitry M. Malioutov, M ujdat C etin, and Alan S. Willsky. Homotopy continuation for
sparse signal representation. In IEEE Int. Conf. Acoustics, Speech and Signal Processing,
Philadelphia, PA, volume 5, pages 733736, March 2005.
[40] Michael R. Osborne, Brett Presnell, and Berwin A. Turlach. A new approach to variable
selection in least squares problems. IMA J. Numerical Analysis, 20:389403, 2000.
[41] Michael R. Osborne, Brett Presnell, and Berwin A. Turlach. On the lasso and its dual.
Journal of Computational and Graphical Statistics, 9:319337, 2000.
[42] M. D. Plumbley. Geometry and homotopy for l1 sparse representations. In Proceedings of
SPARS05, Rennes, France, pages 206213, November 2005.
[43] M. D. Plumbley. Recovery of sparse representations by polytope faces pursuit. In Proceed-
ings of the 6th International Conference on Independent Component Analysis and Blind
Source Separation (ICA 2006), Charleston, SC, USA, pages 206213, March 2006.
[44] Mark Rudelson and Roman Vershynin. Geometric approach to error-correcting codes and
reconstruction of signals. Technical report, Department of Mathematics, Univ. of California
at Davis, 2005.
44
[45] Michael A. Saunders and Byunggyoo Kim. Pdco: Primal-dual interior method for convex
objectives. http://www.stanford.edu/group/SOL/software/pdco.html.
[46] Peter Schmieder, Alan S. Stern, Gerhard Wagner, and Jerey C. Hoch. Application of non-
linear sampling schemes to cosy-type spectra. Journal of Biomolecular NMR, 3(134):569
576, 1993.
[47] Peter Schmieder, Alan S. Stern, Gerhard Wagner, and Jerey C. Hoch. Improved resolution
in triple-resonance spectra by nonlinear sampling in the constant-time domain. Journal of
Biomolecular NMR, 4(188):483490, 1994.
[48] Jean-Luc Starck, Michael Elad, and David L. Donoho. Image decomposition: Separation
of texture from piecewise smooth content. In SPIE annual meeting, San Diego, California,
USA, December 2003.
[49] Jean-Luc Starck, Michael Elad, and David L. Donoho. Redundant multiscale transforms
and their application for morphological component analysis. Advances in Imaging and
Electron Physics, 132:287348, 2004.
[50] Thomas Strohmer and Robert Heath Jr. Grassmannian frames with applications to coding
and communcations. Appl. Comp. Harm. Anal., 14(3):257275, 2003.
[51] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society, 58(1):267288, 1996.
[52] Joel A. Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Trans-
actions on Information Theory, 50(11):22312242, October 2004.
[53] Joel A. Tropp. Just relax: Convex programming methods for subset selection and sparse
approximation. IEEE Transactions on Information Theory, 51(3):10301051, March 2006.
[54] Joel A. Tropp and Anna C. Gilbert. Signal recovery from partial information by orthogonal
matching pursuit. Manuscript, 2005.
[55] Yaakov Tsaig and David L. Donoho. Breakdown of equivalence between the minimal
1
-
norm solution and the sparsest solution. Signal Processing, 86(3):533548, 2006.
[56] Yaakov Tsaig and David L. Donoho. Extensions of compressed sensing. Signal Processing,
86(3):549571, 2006.
45

You might also like