Skip to content

Commit 492e1ec

Browse files
authored
ENH add gap safe screening rules to enet_coordinate_descent (scikit-learn#31882)
1 parent 884e512 commit 492e1ec

File tree

6 files changed

+359
-92
lines changed

6 files changed

+359
-92
lines changed

doc/modules/linear_model.rst

Lines changed: 81 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -233,24 +233,23 @@ Cross-Validation.
233233
Lasso
234234
=====
235235

236-
The :class:`Lasso` is a linear model that estimates sparse coefficients.
236+
The :class:`Lasso` is a linear model that estimates sparse coefficients, i.e., it is
237+
able to set coefficients exactly to zero.
237238
It is useful in some contexts due to its tendency to prefer solutions
238239
with fewer non-zero coefficients, effectively reducing the number of
239240
features upon which the given solution is dependent. For this reason,
240241
Lasso and its variants are fundamental to the field of compressed sensing.
241-
Under certain conditions, it can recover the exact set of non-zero
242-
coefficients (see
242+
Under certain conditions, it can recover the exact set of non-zero coefficients (see
243243
:ref:`sphx_glr_auto_examples_applications_plot_tomography_l1_reconstruction.py`).
244244

245245
Mathematically, it consists of a linear model with an added regularization term.
246246
The objective function to minimize is:
247247

248-
.. math:: \min_{w} { \frac{1}{2n_{\text{samples}}} ||X w - y||_2 ^ 2 + \alpha ||w||_1}
248+
.. math:: \min_{w} P(w) = {\frac{1}{2n_{\text{samples}}} ||X w - y||_2 ^ 2 + \alpha ||w||_1}
249249

250-
The lasso estimate thus solves the minimization of the
251-
least-squares penalty with :math:`\alpha ||w||_1` added, where
252-
:math:`\alpha` is a constant and :math:`||w||_1` is the :math:`\ell_1`-norm of
253-
the coefficient vector.
250+
The lasso estimate thus solves the least-squares with added penalty
251+
:math:`\alpha ||w||_1`, where :math:`\alpha` is a constant and :math:`||w||_1` is the
252+
:math:`\ell_1`-norm of the coefficient vector.
254253

255254
The implementation in the class :class:`Lasso` uses coordinate descent as
256255
the algorithm to fit the coefficients. See :ref:`least_angle_regression`
@@ -281,18 +280,86 @@ computes the coefficients along the full path of possible values.
281280

282281
.. dropdown:: References
283282

284-
The following two references explain the iterations
285-
used in the coordinate descent solver of scikit-learn, as well as
286-
the duality gap computation used for convergence control.
283+
The following references explain the origin of the Lasso as well as properties
284+
of the Lasso problem and the duality gap computation used for convergence control.
287285

288-
* "Regularization Path For Generalized linear Models by Coordinate Descent",
289-
Friedman, Hastie & Tibshirani, J Stat Softw, 2010 (`Paper
290-
<https://www.jstatsoft.org/article/view/v033i01/v33i01.pdf>`__).
286+
* :doi:`Robert Tibshirani. (1996) Regression Shrinkage and Selection Via the Lasso.
287+
J. R. Stat. Soc. Ser. B Stat. Methodol., 58(1):267-288
288+
<10.1111/j.2517-6161.1996.tb02080.x>`
291289
* "An Interior-Point Method for Large-Scale L1-Regularized Least Squares,"
292290
S. J. Kim, K. Koh, M. Lustig, S. Boyd and D. Gorinevsky,
293291
in IEEE Journal of Selected Topics in Signal Processing, 2007
294292
(`Paper <https://web.stanford.edu/~boyd/papers/pdf/l1_ls.pdf>`__)
295293

294+
.. _coordinate_descent:
295+
296+
Coordinate Descent with Gap Safe Screening Rules
297+
------------------------------------------------
298+
299+
Coordinate descent (CD) is a strategy so solve a minimization problem that considers a
300+
single feature :math:`j` at a time. This way, the optimization problem is reduced to a
301+
1-dimensional problem which is easier to solve:
302+
303+
.. math:: \min_{w_j} {\frac{1}{2n_{\text{samples}}} ||x_j w_j + X_{-j}w_{-j} - y||_2 ^ 2 + \alpha |w_j|}
304+
305+
with index :math:`-j` meaning all features but :math:`j`. The solution is
306+
307+
.. math:: w_j = \frac{S(x_j^T (y - X_{-j}w_{-j}), \alpha)}{||x_j||_2^2}
308+
309+
with the soft-thresholding function
310+
:math:`S(z, \alpha) = \operatorname{sign}(z) \max(0, |z|-\alpha)`.
311+
Note that the soft-thresholding function is exactly zero whenever
312+
:math:`\alpha \geq |z|`.
313+
The CD solver then loops over the features either in a cycle, picking one feature after
314+
the other in the order given by `X` (`selection="cyclic"`), or by randomly picking
315+
features (`selection="random"`).
316+
It stops if the duality gap is smaller than the provided tolerance `tol`.
317+
318+
.. dropdown:: Mathematical details
319+
320+
The duality gap :math:`G(w, v)` is an upper bound of the difference between the
321+
current primal objective function of the Lasso, :math:`P(w)`, and its minimum
322+
:math:`P(w^\star)`, i.e. :math:`G(w, v) \leq P(w) - P(w^\star)`. It is given by
323+
:math:`G(w, v) = P(w) - D(v)` with dual objective function
324+
325+
.. math:: D(v) = \frac{1}{2n_{\text{samples}}}(y^Tv - ||v||_2^2)
326+
327+
subject to :math:`v \in ||X^Tv||_{\infty} \leq n_{\text{samples}}\alpha`.
328+
With (scaled) dual variable :math:`v = c r`, current residual :math:`r = y - Xw` and
329+
dual scaling
330+
331+
.. math::
332+
c = \begin{cases}
333+
1, & ||X^Tr||_{\infty} \leq n_{\text{samples}}\alpha, \\
334+
\frac{n_{\text{samples}}\alpha}{||X^Tr||_{\infty}}, & \text{otherwise}
335+
\end{cases}
336+
337+
the stopping criterion is
338+
339+
.. math:: \text{tol} \frac{||y||_2^2}{n_{\text{samples}}} < G(w, cr)\,.
340+
341+
A clever method to speedup the coordinate descent algorithm is to screen features such
342+
that at optimum :math:`w_j = 0`. Gap safe screening rules are such a
343+
tool. Anywhere during the optimization algorithm, they can tell which feature we can
344+
safely exclude, i.e., set to zero with certainty.
345+
346+
.. dropdown:: References
347+
348+
The first reference explains the coordinate descent solver used in scikit-learn, the
349+
others treat gap safe screening rules.
350+
351+
* :doi:`Friedman, Hastie & Tibshirani. (2010).
352+
Regularization Path For Generalized linear Models by Coordinate Descent.
353+
J Stat Softw 33(1), 1-22 <10.18637/jss.v033.i01>`
354+
* :arxiv:`O. Fercoq, A. Gramfort, J. Salmon. (2015).
355+
Mind the duality gap: safer rules for the Lasso.
356+
Proceedings of Machine Learning Research 37:333-342, 2015.
357+
<1505.03410>`
358+
* :arxiv:`E. Ndiaye, O. Fercoq, A. Gramfort, J. Salmon. (2017).
359+
Gap Safe Screening Rules for Sparsity Enforcing Penalties.
360+
Journal of Machine Learning Research 18(128):1-33, 2017.
361+
<1611.05780>`
362+
296363
Setting regularization parameter
297364
--------------------------------
298365

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
- :class:`linear_model.ElasticNet`, :class:`linear_model.ElasticNetCV`,
2+
:class:`linear_model.Lasso`, :class:`linear_model.LassoCV` as well as
3+
:func:`linear_model.lasso_path` and :func:`linear_model.enet_path` now implement
4+
gap safe screening rules in the coordinate descent solver for dense `X` and
5+
`precompute=False` or `"auto"` with `n_samples < n_features`.
6+
The speedup of fitting time is particularly pronounced (10-times is possible) when
7+
computing regularization paths like the \*CV-variants of the above estimators do.
8+
There is now an additional check of the stopping criterion before entering the main
9+
loop of descent steps. As the stopping criterion requires the computation of the dual
10+
gap, the screening happens whenever the dual gap is computed.
11+
By :user:`Christian Lorentzen <lorentzenchr>`.

0 commit comments

Comments
 (0)