Fundamentals of Cone Regression: Mariella Dimiccoli
Fundamentals of Cone Regression: Mariella Dimiccoli
Fundamentals of Cone Regression: Mariella Dimiccoli
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2 Statement of the problem, basic notations and basic facts . . . . . . . 55
2.1 Convex quadratic programming (CQP) formulation . . . . . . . . 57
2.1.1 Primal formulation . . . . . . . . . . . . . . . . . . . . . . 57
2.1.2 Dual formulation . . . . . . . . . . . . . . . . . . . . . . . 57
2.2 Linear complementarity problem (LCP) formulation . . . . . . . 59
2.3 Proximal formulation . . . . . . . . . . . . . . . . . . . . . . . . . 59
3 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.1 Algorithms with asymptotic convergence . . . . . . . . . . . . . . 61
3.1.1 Least squares in a product space (LSPS) . . . . . . . . . . 61
3.1.2 Uzawa method . . . . . . . . . . . . . . . . . . . . . . . . 62
3.1.3 Hildreth’s algorithm . . . . . . . . . . . . . . . . . . . . . 62
3.1.4 Primal-dual interior point methods . . . . . . . . . . . . . 62
3.1.5 Dykstra’s algorithm . . . . . . . . . . . . . . . . . . . . . 63
53
54 M. Dimiccoli
1. Introduction
offers a good basis to discuss the fundamentals of cone regression and related
numerical issues.
The problem of concave regression is to estimate a regression function subject
to concavity constraints represented by a set of linear inequalities. Brought to
the attention of the scientific community by micro-economists interested in es-
timating production functions [1, 2, 3], the problem of concave regression arises
not only in the field of micro-economy (indirect utility, production or cost func-
tions, Laffer curve) but also in medicine (dose response experiments) and biology
(growth curves, hazard and failure rate in survival analysis). First addressed by
Hildreth in 1954 [1], the search for efficient methods for solving large concave
regression problems is still an open issue. This may appear quite surprising con-
sidering the noticeable advances in convex optimization since then, but it can
be understood when considering that most efforts have been devoted to theo-
retical issues such as generalizations and convergence while comparatively little
attention has been paid to the issues of efficiency and numerical performances
in practice [4, 5, 6].
In this paper, we formulate the cone regression problem by different opti-
mization approaches. We highlight similarities and difference between the var-
ious algorithms passed in review, we propose several improvements to enhance
stability and to bound the computational cost. We estimate the expected per-
formance of available algorithms, establishing in particular which is the most
competitive technique for solving large instances of the problem. Finally, in the
light of this study, we give recommendations for further research.
In section 2, we state formally the problem of cone regression also introducing
some basic notations and results that will be used throughout the paper. In
section 3, we survey the state of the art distinguishing between the class of
algorithms with asymptotic convergence and the class of algorithms with finite-
time convergence. In section 6, we make a numerical comparison of performances
and finally, in section 7, we draw some concluding remarks.
y = f (z) + (2.1)
Inference about the response function may be drawn by assuming that the
function f (z) can be approximated by some given algebraic form with several
unknown parameters to be estimated from the data. However, the difficulty with
this procedure is that the inferences often depend critically upon the algebraic
form chosen. Alternatively, one may know some properties of the relation being
studied but does not have sufficient information to put the relation into any sim-
ple parametric form. In this case, a nonparametric approach is more appropiate.
Let xi be the expected value of output at input level zi :
xi = f (zi ) i = 1, ..., n (2.3)
Estimates of xi can be derived by the method of maximum likelihood, or by
the method of least squares or other formulations. If there were no a priori
restriction on f , than the maximum likelihood estimation of xi would just be
the mean of observed output for the level of input zi , that is
1
i=1
x̃i = yit i = 1, ..., n (2.4)
Ti
Ti
Instead, since in the cone regression problem the known property of the re-
gression function f can be expressed by a set of linear inequalities, to obtain
the maximum likelihood estimates, the likelihood function should be maximized
subject to the linear inequality constraints.
Formally, given a dataset of n dependent variables represented by the vectors
w, y ∈ Rn , corresponding to the independent variable values z1 < z2 < ... < zn ,
the problem of cone regression is to estimate the closest function to the dataset
via a least squares regression subject to a set of linear inequality constraints by
solving
x̂ = argmin x − y22,w (2.5)
{x ≤0}
n
with x − y22,w = wi (yi − xi )2
i=1
Denoting by Ki those vectors that satisfy the linear inequality constraints for
a fixed i, then Ki = ∅ is a closed convex set in Rn and the feasibility set K can
be written as the nonempty intersection of a family of closed subsets Ki ⊂ Rn .
Being the intersection of closed convex sets, the set K is also a closed convex
set. More precisely, since each Ki is an half-space which contains the origin, the
feasibility set K is a convex polyhedral cone. In matrix form, K can be written as
K = {x : Ax ≤ 0}. In the case of concave regression A ∈ Rm×n with m = n − 2
is a matrix such that each row Ai represents a concave two-piece linear function
with a negative second difference at xi+1 only and the linear inequalities are as
follows
xi+2 − xi+1 xi+1 − xi
− ≤ 0, i = 1, ..., n − 2 (2.6)
zi+2 − zi+1 zi+1 − zi
In the following, we give alternative formulations of the cone regression prob-
lem that rest on optimization theory.
Fundamentals of cone regression 57
The problem (2.5) is to find the point x̂ in the cone K that is closest to y. The
solution is found at the orthogonal projection of y onto K, written as Π(y|K)
using the metric · 2,w , represented by the symmetric positive definite matrix
W.
x̂ = Π(y|K) = argmin(y − x)T W (y − x) (2.7)
{Ax≤0}
For the problem (2.7), the matrix W is diagonal with element wi on the diagonal.
In practice, if yi is the mean value measured at zi , than wi corresponds to the
size of the sample at zi . Since K is a closed, convex and non empy set on the
Hilbert space Rn , it is a set of Chebyshev, that is the projection exists and it is
unique.
The dual formulation of problem (2.7) rests on the Moreau decomposition theo-
rem [7], which is a generalization in convex analysis of the orthogonal projection
theorem for vectorial sub-spaces. Central to the Moreau decomposition theorem
is the definition of polar cone of a given convex cone K, which is given below.
Definition 1. The polar cone Ko to any convex cone K is given by
Ko = {x ∈ Rn : ∀k ∈ K, k , x ≤ 0} (2.8)
Fig 1. The polar cone Ko of a given convex cone K ⊂ R2 is given by the set of all vector
whose scalar product with vectors of K is negative. The data point y can be written as the
sum of x̂, the projection onto the cone K and x̂o , the projection onto the polar cone Ko .
Denoting by λ̂ the solution to the dual problem, the solution to the primal prob-
lem is x̂ = y − AT λ̂.
As it can be observed, in the dual formulation each element of the vector λ
must satisfy a single positivity constraint.
Goldman [9] noticed that dual problem can be also viewed as a minimum
distance problem in the same parameter space as the primal problem.
The CQP (2.7) can be solved by using a proximity operator [10, 11]. Proximity
operators are used to solve problems of the form
argmin f1 (x) + f2 (x)... + fm (x) (2.17)
x∈Rn
where f1 , f2 , ..., fm are convex functions from Rn to ] − ∞, +∞], that are not
necessarily differentiable. Each fi is treated through its proximity operator,
which is defined as follows.
Definition 3. Let Γ0 (Rn ) be the class of lower semicontinuous convex functions
from Rn to ] − ∞, +∞] such that their domain, denoted by dom(f ) is not the
empty set. Let f be a function f ∈ Γ0 (Rn ), then the proximity operator of f is
proxf (x) : Rn → Rn such that
1
∀x ∈ Rn , proxf (y) = argmin f (x) + ||x − y||2 (2.18)
x∈Rn 2
60 M. Dimiccoli
∂f = u ∈ Rn : ∀y ∈ Rn : (y − p)T u + f (p) ≤ f (y)) (2.20)
m
x̂o = argmin
||x − y||2 = argmin{||x − y||2 + ıKio (x)} (2.23)
x∈ m
i=1 Kio x∈Rn i=1
= prox m
i=1 ıKo
y (2.24)
i
In this section, we review existing algorithms for solving the cone regression
problem. The algorithmic approaches, and in turn their numerical performances,
strongly depend on the choice of the problem formulation. All existing methods
are iterative and they can attain the optimal solution since K is a Chebyshev
set and therefore the optimal solution must exist in this closed set. However,
Fundamentals of cone regression 61
in terms of their numerical perfomances they can be classified into two broad
classes: the class of methods that never (or in very simple cases) attain the opti-
mal solution [1, 12] and those of methods that converge to the optimal solution
in a finite number of steps [13, 14, 15, 16, 17, 18, 19, 20]. As it will be clarified
in the following, methods with asymptotic convergence rest on the properties of
the sub-gradient or more in general of proximity operators and act by finding
the solution as the limit of a sequence of successive approximations. They are
typically derived from the primal, the dual or from the proximal formulation.
Methods with finite-time convergence exploit the geometric properties of poly-
hedral convex cones and find the exact solution as a non-negative linear combi-
nation of functions, forming a basis in a specified finite dimensional space. They
are typically derived from the linear complementarity problem formulation.
This section includes algorithms based on the primal formulation such as Least
Squares in a Product Space (section 3.1.1), algorithms based on the dual for-
mulation such as the Uzawa’s method (section 3.1.2) and Hildreth’s method
(section 3.1.3), algorithms that solve the dual problem simultaneously with the
primal problem such as the Dykstra’s alternating projection method (section
3.1.5) and algorithms based on the proximal formulation such as Alternating
Direction Method of Multipliers (section 3.1.6).
Since the Euclidean space Rn equipped with the dot product is an Hilbert space
H, the problem (2.7) can be recast in the m−fold Cartesian product of H, say
Hm [35]. Let Km be the Cartesian product of the sets (Ki )i∈I , i.e., the closed
convex set Km = ×i∈I Ki = {x ∈ H : ∀i ∈ I : xi ∈ Ki } and let D be the diagonal
vector subspace, i.e. D = {(x, ..., x) ∈ Hm : x ∈ H}. Then, the CQP (2.7) is
equivalent to
argmin x − ȳ22,w (3.1)
{x∈Km ∩D}
where ȳ = (y, ..., y). Using this strategy, the problem of projecting onto the
intersection of m convex sets is reduced to the problem of projecting, in a
higher dimensional space, onto only two convex sets, one of which is a simple
vector subspace. Geometrically, this is equivalent to find a point in D which is
at minimum distance from Km . This point can be obtained iteratively by
The advantage of this strategy is that it speeds up the convergence since a bigger
(than 2, which is the upper bound for Fejér sequences [36]) relaxation interval
can be allowed.
62 M. Dimiccoli
Hildreth [1] proposed to apply the Gauss Seidel algorithm [21] to the dual prob-
lem (2.9). A single cycle of the Hildreth’s algorithm consists in updating each ele-
ment of λ sequentially in an arbitrary fixed order. Therefore each cycle consists of
m steps, each of which corresponds to a projection onto the cone Ki ,i = 1, ..., m.
The algorithm gives rise to a sequence of points, each one differing from the pre-
ceding in exactly one coordinate. At the cycle k + 1, the λk+1 i is used in the
estimation of the point λk+1 i+1 so that the best available estimations are used
for computing each variable. The convergence of the Gauss Seidel algorithm is
guaranteed only if the matrix A is full row rank, so that there are not redun-
dancies among the inequality restrictions, and it is guaranteed independently of
the initial point λ0 only if A is positive definite and symmetric. The algorithm
is sensitive to the normalization as well as to the order of the projections.
wT λ − μe
Fundamentals of cone regression 63
If each Ki is a subspace, then at each new cycle k +1 the residuum −Rik of the
projection onto each convex cone Ki is of course the projection of xk onto the
cone Kio . Therefore, the Dykstra’s procedure for subspaces reduces to exactly
the cyclic, iterated projections of von Neumann. In this case, for k → ∞, the
sum of the residua over the cones Kik approximates the projection of y onto
the polar cone Ko and therefore, for the Moreau decomposition theorem, xk
approximates the projection onto K.
However, if the Ki are not subspaces, Π(·|Ki ) is not a linear operator and
then the von Neumann algorithm does not necessarily converge.
The Dykstra’s algorithm can also be interpreted as a variation of the Douglas–
Rachford splitting method applied to the dual proximal formulation (2.23).
The seminal works of Hildreth and Dykstra have inspired many studies
mostly devoted to theoretical investigations about their behavior in a Hilbert
space [23, 24], about its convergence [25, 26], about its relation to other meth-
64 M. Dimiccoli
ods [27, 28] and about its interpretation in more general frameworks, such as
the proximal splitting methods [29]. Han [30], as well as Iusem and De Pierro
[25], showed that in the polyhedral case, the method of Dysktra becomes the
Hildreth’s algorithm and therefore it has the same geometric interpretation of
Gauss Seidel to the dual problem. Gaffke and Mathar (1989) [27] showed the
relation of the Dysktra algorithm to the method of component-wise cyclic min-
imization over a product space, also proposing a fully simultaneous Dykstra
algorithm. The only works devoted to give some insight about a more efficient
implementation are the ones of Ruud. Goldman and Ruud [9](1993) generalized
the method of Hildreth showing that there is no need to restrict the iterations
to one element of λ at a time. One can optimize over subsets or/and change
the order in which the elements are taken. This observation is important for
the speed of convergence since the slow speed can be understood as a symptom
of near multicollinearity among restrictions. Because the intermediate projec-
tions are so close to one another, the algorithm makes small incremental steps
towards the solution. They also remarked that Dykstra uses a parametrization
in the primal parameter space, that causes numerical round off errors in the
variable residuum. These round off errors cumulate so that the fitted value does
not satisfy the constraints of the dual problem. It would be better to use a
parametrization on the dual so that the contraints in the dual would be sat-
isfied at each iteration. Later, Ruud [31] proved that the contraction property
of the proposed generalizations rests solely on the requirement that every con-
straint appears in at least one subproblem of an iteration. As one approaches
the solution, constraints that are satisfied at the solution are eliminated. To re-
move satisfied constraints would accelerate the Hildreth procedure. The authors
propose to reduce the set of active constraints, that is the constraints satisfied
as an equation at the corresponding points, by removing as many constraints as
possible through periodic optimization over all positive elements of λ.
where the matrix A is assumed to be irreducible (AAT = vI, v > 0) and the
intersection of the relative interiors of the domains of the two functions is as-
sumed to be not empty (ri dom(g) ∩ ri dom(f ) = ∅). ADMM minimizes the
augmented Lagrangian L over the two variables of the problems, say x and z,
first x with z fixed, then over z with x fixed, and then applying a proximal
maximization step with respect to the Lagrange multiplier λ. The augmented
Lagrangian of index γ ∈ [0, ∞] is
1 T 1
L(x, z, y) = f (x) + g(z) + λ (Ax − z) + ||Ax − z||2 (3.5)
γ 2γ
Fundamentals of cone regression 65
All algorithms that will be reviewed in this section are active set methods resting
on the LCP formulation 2.15. Active set methods work by choosing a subset of
indices j ∈ J˜ ⊂ J = {1, ..., n} such that wj is allowed to be non-zero and forcing
the corresponding λj to be zero, while the remaining indices j ∈ J \ J˜ force wj to
be zero and allow λj to take nonzero values. In this section we will review active
set algorithm such as the mixed primal-dual basis algorithm (section 3.2.3),
the critical index algorithm (section 3.2.4) and the Meyer’s algorithm (section
3.2.5).
Before detailing the algorithms, we introduce some definitions and basic re-
sults about the geometry of polyhedral convex cones, on which are based the
algorithms presented in this section. For further details the reader is referred to
[38].
Lemma 1 establishes the relationship between the constraint matrix A and the
edges of the polar cone {γ i , i = 1, ..., m}, namely AT = [γ 1 , ..., γ m ]. We would
like now to determine the edges of the constraint cone K.
Let the vectors {γ m+1 , .., γ n } be orthogonal to the vectors {γ i , i = 1, .., m}
and orthonormal to each other so that the set {γ i , i = 1, .., n} forms a basis for
Rn . By defining the dual basis of the basis {γ i , i = 1, .., n} as the set of vectors
{β i , i = 1, ..., n} that verify the relationship
−1 i = j
(β i )T γ j = (3.6)
0 i = j
the constraint cone K = {x : Ax ≤ 0} can be equivalently written as
m
n
K= x:x= bi β i + bi β i , bi ≥ 0, i = 1, ..., m (3.7)
i=1 i=m+1
To see that let B = [β 1 , ..., β n ] and C = [γ 1 , ..., γ n ]. Then Ax are the first
m coordinates of Cx. Since B T C = −In by construction, by multiplying both
members at left for B −1 and at right for x, we obtain: Cx = −B −1 x. Therefore
Cx gives the negative coordinates of x in the basis {β i , i = 1, ..., n}. Further-
n in K have their first m coordinates non-negative and can be written
more, points
as x = i=1 bi β i , where bi ≥ 0 for i = 1, .., m.
Taking into account Def. 2, (3.7) established that the vectors β i are the edges
of the constraint cone K.
66 M. Dimiccoli
the sets:
x ∈ Rn : x = bj β j , bj ≥ 0
j∈J
forms a partition of K .
Denoting by u and v the projections of y onto span(K) and span({γ m+1 , ...,
γ }) respectively, being the two subspaces orthogonal, y can be written as y =
n
Fig 2. The point x1 ∈ K belongs to the open face FJ = {x ∈ K : x = bβ 1 , b > 0}, with
T
J = {1}. The support cone of K at x1 is LK (x1 ) = {x : γ 2 x ≤ 0} and its dual is LoK (x1 ) =
{x : x = cγ , c ≥ 0}. The set of points that project onto x1 is given by the set Π−1
2
K (x1 ) =
{x1 + LoK (x1 )} = {x1 + cγ 2 , c ≥ 0}. The point x3 ∈ K belongs to the open face FJ = {0},
with J = ∅. The support cone of K at x3 is K and its dual is Ko , so that the set of points
projecti onto x3 is {K }. The point x4 ∈ K belongs to the open face FJ = {x ∈ K : x =
that o
i=1,2 bi β , bi > 0}, with J = {1, 2}. The support cone of K at x4 is the origin and its dual
is the origin, so that the set of points that project onto x4 is {x4 }.
Then any point in span(K) projects onto a unique point in K and belong to
a unique non-negative orthant, or sector SJ
SJ = x ∈ R n : x = bj β j + cj γ j , bj > 0, cj ≥ 0 (3.8)
j∈J j ∈J
/
i
work of Wilhelmsen
[13] in 1976.
Wilhelmsen assumes
that the m generators β
m
of the cone K = x ∈ R : x = i=1 bi β , bi > 0 are known and propose an
n i
Fraser and Massam [16] proposed an iterative algorithm to solve the general
problem of projecting a data point y ∈ Rn onto a polyhedral convex cone
K ⊂ Rn generated by m ≤ n linear inequality constraints.
Polyhedral convex cones generated by a number of inequalities at least equal
to the dimension of the space they belong to have been the subject of section
3.2.1. As seen there, the problem of projecting a data point onto this class of
cones can be reduced to find the set of edges, or generators of the cone, indexed
by Jˆ ⊆ 1, ..., n such that the sector SJˆ contains the data point.
To this goal, the set of edges of the polar cone {γ i , i = 1, ..., m} is completed
by n − m vectors orthogonal to {γ i , i = 1, ..., m} and orthonormal to each other.
In the case of concave regression m = n − 2, the set is completed by a constant
function γ m+1 = 1/||1||, where 1 is the m-dimensional unitary vector and by
a linear
m function γ
m+2
= (x − x̄1)/||(x − x̄1)||, where x = (x1 , ..., xm ) and
x̄ = i=1 xi /m. The set of vectors {γ i , i = 1, ..., n} form a basis for Rn .
Let the vectors {βi , i = 1, ..., n} be the dual basis of the basis {γ i , i = 1, ..., n}
as defined in (3.6). Fraser and Massam called the vectors β i and γ i primal and
dual vectors respectively. A primal-dual basis for Rn , associated to the set of
indices J ⊆ {1, ..., n} ≡ L is a basis BJ = [α1 , ..., αn ] made up of a subset of
the primal basis vectors {βi }i∈J and a complementary subset of the dual basis
vector {γi }i∈L\J . For n = m the primal basis vectors, corresponding to the
edges of K, are simply the columns of −(AT )−1 . Using the above definitions,
the problem of projecting a point y ∈ Rn onto the cone K can be formulated as
follows.
Theorem 3.1. The primal constrained quadratic minimization problem (2.7)
is equivalent to the problem of finding
argmin ||u − x||2 (3.9)
x∈K
Finding the sector containing u is achieved moving along a fixed line joining
an arbitrary chosen initial point x0 inside the cone or on its boundary to the
data point u. By moving along a fixed line, many sectors are crossed. Each time
a sector is crossed the successive approximation xk is obtained by projecting the
point on the line passing through u and x0 on the face FJ k of K (see Lemma 2)
so that the distance ||u − xk || is decreasing in k.
At each iteration each basis differs from another by one vector only and
therefore the coordinates of xk onto the new basis are easy to calculate. What
changes is that one coordinate becomes equal to zero and therefore there is
no need to estimate the inverse of a matrix at each step. For this reason this
algorithm is faster that the one of Wilhelmsen.
Points belonging to the sector SJ k have non-negative coordinates in the mixed
primal-dual basis BJ k relative to the cone K. Therefore the procedure terminates
when the coordinates of the data point in the primal dual basis BJ k are all non-
negative, meaning that the point xk is on the face of the sector containing the
data point u.
The number of iterations needed is equal to the number of different sectors
that the line joining the initial point to the data point has to cross. This number
is bounded above by 2n . It is worth to remark that crossing a sector corresponds
to a pivot step in the equivalent LCP. In fact, the net effect of a pivot step is
of moving from the point xk contained in the face FJ k of K to the point xk+1
contained in the face FJ k+1 of K.
Ten years later, Meyer [17] generalized the algorithm of Fraser and Massam
to the case of more constraints than dimensions, that is when m > n.
Murty and Fathy (1982) [19] considered the general problem of projecting a
given vector y ∈ Rn onto a simplicial cone K ⊂ Rn . The definition of simplicial
cone and pos cone, on which the former definition is based are as follows.
Definition 6. The pos cone generated by the
mvectors in Δ = {δ i , i = 1, ..., m},
denoted by pos(Δ), is the set {x ∈ R |x = i=1 di δ , di ≥ 0}.
n i
orthogonal to γ and the cone pos(Γ ∪ {−γ }) is the direct sum of the full line
l l
then the projection of y onto the linear hull of {x̄, γ i } is in the relative interior
of pos{x̄, γ i }
One general strategy for reducing the computational cost of a large-scale opti-
mization problem is to use an initial guess that is easier to calculate and close
to the optimal solution.
Within the class of algorithms with asymptotic convergence, splitting-based
methods work by activating each of the convex constraints repetitively and by
combining them to obtain a sequence converging to a feasible point. Since the
projection point Π(y|K) is characterized by the variational inequality
the projection operator Π(·|K) is a closed contraction. Therefore the set of fixed
points of Π(·|K) is exactly K. This prevents the use of an initialization point
belonging to the feasible set as well as the use of multiscale strategies since there
is no guarantee that the solution from a previous level does not belong to the
feasible set.
The same difficulty arises when considering interior point algorithms, since
they need to be initialized to an interior point. In [9], Goldman proved that the
74 M. Dimiccoli
Dykstra’s algorithm can potentially start from better starting values than the
given data point y. The author established the convergence to the nearest point
to the primal cone from an arbitrary point in the intersection of C and the ball
of radius ||y||, where C = {x|x = y − AT λ, λ ≥ 0}, is a rotation of π radiants
of the dual cone Ko with its vertex translated at y. A point satisfying these
conditions can be obtained efficiently by using distance reduction operations
based on Theorem 3.8. It is worthwhile to remark that this result can be easily
interpreted in the active set framework. In fact, the Dykstra’s algorithm can
be undestood as a primal active set method and its solution is a primal dual
feasible point. Therefore, any dual feasible point, that is every point belonging
to the set C, can be used as initialization.
All algorithms with time finite convergence detailed in the previous section
are primal-dual and they reduce the problem of projecting a given data point
onto a convex set to the problem of finding the set of indices corresponding to
not saturated constraints at the solution. In general, they involve the selection
of a subset from a collection of items, say Jˆ ⊆ {1, ..., m}. With this formulation,
they potentially allow us to take advantage of a good estimate of the optimal
active set. However, the adaptation is not rapid since the active set estimate is
updated using one-by-one changes preventing this class of method from being
effective general-purpose solvers for large-scale problems.
In the algorithm of Fraser and Massam, the set of successive approximations
are obtained by moving along a fixed line connecting the initial guess to the
data point. The number of iterations needed to reduce the data point reduces
the number of sectors to be crossed to join the sector containing the data point.
By contrast, in the algorithm of Meyer the proximity of the initial guess to
the data point is not a good criterion selection for the initial guess. In fact,
given an initial guess, the solution is attained by adding and/or removing one
index at time until an optimal solution is found. Taking into account that the
optimal solution can be interpreted as the best approximation with the biggest
possible number of hinges, if the optimal solution contains just a few hinges,
than using the empty set as an initial guess would be much faster than using
the full set of possible hinges. On the other hand, if just a few constraints are
satisfied at equality in the optimal solution, than the full set of indices will be
a much better initial guess than the empty set. Therefore even if the choice of
the initial guess may highly influence the performances, its choice depends on
the data and there is not a well established criterion to fix it.
The Murty and Fathy’s algorithm reduces the size of the problem each time
a critical index is found. Therefore it is not compatible with strategies that
take advantage of a good initial estimate since a good estimate does not lead to
finding a critical index faster.
To overcome the limitation of active set methods in taking advantage of a
good estimate, Curtis et al. [45] have recently proposed an heuristic framework
that allows for multiple simultaneous changes in the active-set estimate, which
often leads to a rapid identification of the optimal set. However, there is no
guarantee of the computational advantages for general problems and, further-
more, the authors recommend their approach when solving generic quadratic
Fundamentals of cone regression 75
programming problems with many degrees of freedom. That is not the case of
general concave regression problems.
Another strategy to deal with large projection problems would be to build and
evaluate the solution incrementally according to the order of its input, as done
in online methodologies developed for dynamic optimization [46] over a stream
of input. Even if the input is given in advance, inputs are processed sequentially
and the algorithm must respond in real-time with no knowledge of future input.
Each new input may cause rising or falling constraints and the final solution is
a sequence of feasible solutions, one for each time step, such that later solutions
build on earlier solutions incrementally.
Of course this strategy requires that the algorithm responds in real-time
for each new input. That would not possible when dealing with large matrix
inverse computations. Let x̂ ∈ Rn be the projection of y ∈ Kn onto K and let
ˆ ∈ Kn+1 be the projection of ȳ ∈ Rn+1 onto Kn+1 . When a new element is
x̄
added to the data point, a new constraint is added too, so that the constraint
cone has a new edge. If this constraint corresponds to a critical index, that is to
a constraint satisfied at equality in the optimal solution, then the projection face
will be the same so that no further computing will be needed. On the contrary,
if the new constraint does not correspond to a critical index, the projection
face will change, including the edge corresponding to the new constraint and
removing and adding some others. Therefore, the major difficulty faced by an
online strategy is the same faced in exploiting good estimates.
slightly from the matrix of the previous iteration. What ’slightly’ exactly means
depends on the specific algorithm and it is detailed in the following.
The algorithm of Fraser and Massam involves the computation of an n × n
fixed size matrix inverse at each iteration. The matrix to be inverted differs from
the matrix of the previous iteration only for one column, being the change of
the form A → (A + u × v), where u is a unitary vector with only one nonzero
component corresponding to the index of the column to be changed, and v
corresponds to the difference between the elements of the dual basis vector and
the elements of the primal basis vector or vice versa. In this case the inverse can
be updated efficiently by using the Sherman-Morrison formula: (A + u × v)−1 =
−1
z×w
1+λ , where z = A u, w = (A−1 )T v, λ = v T z. Therefore only two matrix
multiplications and a scalar product are needed to compute the new inverse at
each step.
The algorithm of Meyer, as well as the one of Liu and Fathi [20] involve the
computation of an inverse matrix of variable size at each iteration. The matrix to
be inverted differs from the matrix of the previous iteration only for one column,
which has been added or removed. Since the matrix to be inverted A(J) ∈ Rr×n ,
with r ≤ m is generally rectangular, the Moore-Penrose generalized inverse or
pseudoinverse is computed:A(J)† = A(J)T (A(J)A(J)T )−1 .
In Matlab and Scilab, the computation of the pseudoinverse is based on the
Singular Value Decomposition (SVD) and singular values lower than a tolerance
are treated as zero. The advantage of this approach is that the pseudoinverse
can be computed also for a nonsingular matrix. However, the method proposed
by Liu and Fathi [20] to improve the computational efficiency of their algorithm
does not take advantage of the SVD approach since it consists in updating the
matrix (A(J)A(J)T )−1 . If the matrix A(J) is ill-conditioned, then the inverse
cannot be computed with good accuracy and for the matrix A(J)AT (J) is even
more so because this operation squares the condition number of the matrix
A(J).
A better solution would be to update directly the pseudoinverse. This can
be achieved when a column is added
byusing the method proposed in [47] Let
AT ∈ Rn×r , x ∈ Rn and B = AT x ∈ Rn×(r+1) . The pseudoinverse of B
can be computed from the pseudoinverse of AT as follows.
†
A − A† xw† wT
B† = † , where w = (I − AA† )x and w† = .
w ||w||2
SVD decomposition. The two main operations on which his method is based are
the full rank Cholesky factorization of AT A and the inversion of LT L, where L
is a full rank matrix . On a serial processor these computations are of complexity
order O(n3 ) and O(m3 ) respectively. However, in a parallel architecture, with
as many processors as necessary, the time complexity for Cholesky factorization
of AT A could reduce to O(n), while the time complexity for the inversion of the
symmetric positive definite matrix LT L could reduce to O(log(r)). However, the
computational advantage of this method can be appreciated only when r << n,
since the inverse of a r × r matrix has to be computed, which is not in general
the case, specially for concave regression problems.
The method proposed by Zhu and Li [51] for recursive constrained least
squares problems, found that the exact solution of Linear Equality-constrained
Least Squares can be obtained by the same recursion as for the unconstrained
problem, provided that the Rescricted Least Squares procedure is appropriately
initialized. However, even this approach offers significant advantages in term of
computational cost only when the number of constraints m is small, which is
not the case for large-scale concave regression problems.
An active set algorithm for solving the concave regression problem generates
a sequence of quasi stationary points. A primal (dual) feasible active set al-
gorithm generates a sequence of primal (dual) feasible quasi stationary points
with decreasing objective function values and terminates when the dual (primal)
feasibility is satisfied. An active set J induces a unique partition of the indices
{1, ..., n} into blocks. A block B of such partition is a set of consecutive integers,
say, {p, p+1, ..., q}, such that the index i of the constraint xi+1 −xi ≤ xi+2 −xi+1
is in J for each i such that p ≤ i ≤ q − 2. Conversly any such partition of indices
determines a unique active set. Denoting by λi the multiplier associated with
the ith constraint, the KTT optimality conditions can be written as:
⎧
⎪
⎪ xi+1 − xi ≤ xi+2 − xi+1 i = 1, ..., n − 2
⎪
⎪
⎪
⎪ y1 − x1 = λ1
⎪
⎪
⎪
⎪ y2 − x2 = −2λ1 + λ2
⎪
⎪
⎨ y3 − x3 = λ1 − 2λ2 + λ3
yn−1 − xn−1 = λn−4 − 2λn−3 + λn−2
⎪
⎪
⎪
⎪ yn−1 − xn−1 = λn−3 − 2λn−2
⎪
⎪
⎪
⎪ yn − xn = λn−2
⎪
⎪
⎪
⎪ λi ≥ 0 i = 1, ..., n − 2
⎩
λi (2xi+1 − xi − xi+2 ) = 0 i = 1, ..., n − 2
n n
It is easy to show that: i=1 i(yi − xi ) = 0 and i=1 (yi − xi ) = 0. Knowing
that each block can be represented by an affine function xi = α + βi, the above
78 M. Dimiccoli
6. Comparative results
Table 1
Qualitative comparison of all algorithms
tion for finite-time active set methods, we considered three different initializa-
tions of the active set J: the empty set, the set of m indexes corresponding
to the linear inequality constraints, and the set of not saturated constraints
obtained by using the algorithm inspired to PAV described in the previous sec-
tion.
In the class of algorithms with asymptotic convergence, the most efficient
appears to be ADMM. This is evident even for very small size data, when
using difficult signals such as near-convex signals or very noisy signals (see Fig-
ure 5 and Figure 6 of the Appendix A). For noisy concave signals (Figure 4
of the Appendix A), the computational efficiency of ADMM is more evident
in the presence of a high level of noise. Already for signals of size 500 the
performance of ADMM is not very good; convergence (SSE < 10−6 ) is not
completely attained. The algorithm of Meyer is very sensitive to the initial-
ization. It gives good performances when the signal is a Gaussian white noise
and the initial active set is empty since most of constraints are expected to
be saturated at the solution (see Figure 6 and Figure 9 of the Appendix A).
Given a good initialization, the MPDB algorithm allows to compute the ex-
act solution faster than other methods. However, this algorithm is not nu-
merically stable since it requires one to compute the inverse of a matrix at
each iteration. As explained in section 4.3, this is done incrementally by using
the Shermann-Morrison formula. However, numerical round-off errors cumulate
so that, in our implementation, the exact inverse is computed each 150 iter-
ations. As it can be observed in Figure 10, Figure 11 and Figure 12 of the
Appendix A, that refer to signals of size 1000, sometimes the round-off error
dominates the calculation and the distance to the solution increases instead of
decreasing.
These results demonstrated that, although there have been theoretical and
practical advances in recent years, the use of shape-constrained nonparametric
techniques for solving large scale problems (more than many thousand of points)
is still limited by computational issues. To deal with very large scale problems,
up to a million of points, matrix inverse calculation should be avoided and more
efforts should be devoted to find a way to better initialize primal-dual methods,
by computing approximate solutions and exploiting multi-scale strategies.
80 M. Dimiccoli
7. Conclusions
In this paper we have stated, analyzed and compared qualitatively and quantita-
tively several optimization approaches for solving the cone regression problem.
We have distinguished two broad classes of methods. On the one side, methods
with asymptotic convergence that rest on the properties of proximity operators
and act by finding the solution as the limit of a sequence of successive approx-
imations. On the other side, methods with finite-time convergence that exploit
the geometric properties of polyhedral convex cones and find the exact solution
as non-negative linear combination of functions, forming a basis in a specified
finite dimensional space.
Simulations up to one thousand of points have demonstrated that the choice
of the optimization approach severely impacts algorithmic performances. In par-
ticular, it has emerged that methods with finite-time convergence are much more
efficient with respect to asymptotic-convergence methods. However, from this
study it emerged that they face a twofold difficulty to cope with large-scale
optimization. The first difficulty arises from the fact that all algorithms of this
class modify the active set estimate one-by-one, making the adaptation of the
initial active set estimation very slow; the second difficulty lies in the fact that
they involve the computation of the inverse or the pseudoinverse of a matrix
at each variation of the active set. Although there exists many methods to do
that efficiently when the matrix rank is much lower that the size of the data,
this condition cannot be assured in general. Incremental strategies to reduce the
cost of computing the inverse of a matrix when the inverse of a slightly different
matrix is known, are bounded by round-off error accumulation in an iterative
setting.
The results of this study suggest that to be able to treat very large-scale prob-
lems (up to a million of points) further research should focus on finding a way to
exploit classical multi-scale strategies and to compute an approximate solution
through a penalization or splitting method without involving any matrix inverse
calculation.
Appendix A
⎧
⎨ 2nsin( 5n
24
z) z = 1, ..., n3
S1 (z) = α + βz z = n3 + 1, ..., 2n
⎩ 3
3
γz + δ z = 2n
3 + 1, ..., n
−2 3
where β = 1
10 , α = 2nsin( 85 ) − βn, γ = n2 3 − γ 27 .
and δ = α + β 2n 8n
S2 (x) = N (μ, σ 2 )
6
S3 (z) = sinc( z − 1)
n
Fundamentals of cone regression 81
Fig 3. Signals S1 (top left) and S2 (top rigth) to which has been added white noise with three
different values of standard deviation σ = 0.01, σ = 0.1, and σ = 0.5.
82 M. Dimiccoli
Fig 4. Distance to the solution (L2 norm) for a signal of type S1 of size 50. From top to
the bottom, three increasing level of noise have been added. (Left) Algorithms with finite-
time convergence: all them converge instantaneously. (Right) Algorithms with asymptotic
convergence: ADMM is more robust to noise. LSPS and Dykstra use a parametrization in the
primal parameter space which causes numerical round-off errors to cumulate so that the fitted
values do not satisfy the constraints of the dual problem and convergence is not attained.
Fundamentals of cone regression 83
Fig 5. Distance to the solution (L2 norm) for a signal of type S2 of size 50. From top to
the bottom, three increasing level of noise have been added. (Left) Algorithms with finite-
time convergence: all them converge instantaneously. (Right) Algorithms with asymptotic
convergence: both ADMM and Hildreth attain the solution when no noise is added but ADMM
is much more robust to noise.
84 M. Dimiccoli
Fig 6. Distance to the solution (L2 norm) for Gaussian white noise signals of size 50. (Left)
Algorithms with finite-time convergence: all them converge instantaneously. (Right) Algo-
rithms with asymptotic convergence: ADMM is the only algorithm that converges.
Fundamentals of cone regression 85
Fig 7. Distance to the solution (L2 norm) for a signal of type S1 of size 500. From top to the
bottom, three increasing levels of noise have been added. (Left) Algorithms with finite-time
convergence: Meyer’s algorithm is very sensitive to the initialization. The best performance
is achieved by MPDB with a PAV approximate initialization. (Right) Algorithms with asymp-
totic convergence: ADMM converges only when the signal is slighly noisy.
86 M. Dimiccoli
Fig 8. Distance to the solution (L2 norm) for a signal of type S2 of size 500. From top to
the bottom, three increasing level of noise have been added. (Left) Algorithms with finite-time
convergence: the best performance is achieved by MPDB with a PAV’s inspired initialization.
(Right) Algorithms with asymptotic convergence: no algorithm converges.
Fundamentals of cone regression 87
Fig 9. Distance to the solution (L2 norm) for a Gaussian white noise signal of size 500. (Left)
Algorithms with finite-time convergence: the best performance is achieved by MPDB with a
PAV’s inspired initialization. (Right) Algorithms with asymptotic convergence: no algorithm
converges.
88 M. Dimiccoli
Fig 10. Distance to the solution (L2 norm) for a signal of type S1 of size 1000. From top to
the bottom, three increasing level of noise have been added. (Left) Algorithms with finite-time
convergence: the best performance is achieved by MPDB with a PAV’s inspired initialization.
(Right) Algorithms with asymptotic convergence: no algorithm converges.
Fundamentals of cone regression 89
Fig 11. Distance to the solution (L2 norm) for a signal of type S2 of size 1000. From top to
the bottom, three increasing level of noise have been added. (Left) Algorithms with finite-time
convergence: the best performance is achieved by MPDB with a PAV’s inspired initialization.
(Right) Algorithms with asymptotic convergence: no algorithm converges.
90 M. Dimiccoli
Fig 12. Distance to the solution (L2 norm) for a signal of type S3 of size 1000. (Left)
Algorithms with finite-time convergence: the best performance is achieved by MPDB with a
PAV’s inspired initialization when the round-off error does not dominate the calculation and
by the Meyer algorithm whose initial active set is empy. This is easy to understand since, for
a pure noise signal a high degree of freedom is expected at the solution and therefore the empty
active set is close to the final active set. (Right) Algorithms with asymptotic convergence: no
algorithm converges.
Fundamentals of cone regression 91
Basic notations:
y is the input signal
N is the iteration number (when required by the algorithm)
x̂ is the output signal
x̂ ←− y − AT λN
end
end
AT T T
i (z Ai )
proxAi z = prox{x|Ai x≤0} z = ||Ai ||2
92 M. Dimiccoli
Algorithm 5: Uzawa
Data: y ∈ Rn , A ∈ Rm×n , maximum number of iterations N
Result: x̂ = argmin ||y − x||2 subject to Ax ≤ 0, which is equivalent to 3.3
begin
ρ≥0
k ←− 1
for k = 1 : N do
xk = proxg y
μk+1 = maxμ≥0 μk + ρAxk
end
g(x) =< μ, Ax >
proxg y = argminx 12 ||x − y||2 + g(μ)
Fundamentals of cone regression 93
x̂ ←− Δ(J)Δ(J)† u + v
end
Δ(J) = {δj , j ∈ J}
94 M. Dimiccoli
x̂ ←− y − θ
end
A(J) = {Aj , j ∈ J}
Fundamentals of cone regression 95
end
Γ(J) = {Γj , j ∈ J}
96 M. Dimiccoli
Acknowledgements
The author would like to thank Lionel Moisan for insighful suggestions. This
work has been carried out during a postdoctoral stage in the Laboratory of
Applied Mathematics (MAP5, CNRS UMR 8145) at Paris Descartes University.
Supplementary Material
References